Best Informatica Practices26064718

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-1
Best Practices: Table of Contents

Best Practices BP-2
Configuration Management BP-2
Database Sizing BP-2
Migration Procedures BP-5
Development Techniques BP-36
Data Cleansing BP-36
Data Connectivity using PowerCenter Connect for BW Integration Server BP-42
Data Connectivity using PowerCenter Connect for MQSeries BP-47
Data Connectivity using PowerCenter Connect for SAP BP-52
Data Profiling BP-60
Data Quality Mapping Rules BP-63
Deployment Groups BP-69
Designing Analytic Data Architectures BP-72
Developing an Integration Competency Center BP-82
Development FAQs BP-90
Key Management in Data Warehousing Solutions BP-99
Mapping Design BP-103
Mapping Templates BP-107
Naming Conventions BP-110
Performing Incremental Loads BP-120
Real-Time Integration with PowerCenter BP-126
Session and Data Partitioning BP-136
Using Parameters, Variables and Parameter Files BP-141
Using PowerCenter Labels BP-157
Using PowerCenter Metadata Reporter and Metadata Exchange Views for
Quality Assurance
BP-162
Using PowerCenter with UDB BP-164
Using Shortcut Keys in PowerCenter Designer BP-170

PAGE BP-2 BEST PRACTICES INFORMATICA CONFIDENTIAL
Web Services BP-178
Working with PowerCenter Connect for MQSeries BP-183
Error Handling BP-190
A Mapping Approach to Trapping Data Errors BP-190
Error Handling Strategies BP-194
Error Handling Techniques using PowerCenter 7 (PC7) and PowerCenter
Metadata Reporter (PCMR)
BP-205
Error Management in a Data Warehouse Environment BP-212
Error Management Process Flow BP-220
Metadata and Object Management BP-223
Creating Inventories of Reusable Objects & Mappings BP-223
Metadata Reporting and Sharing BP-227
Repository Tables & Metadata Management BP-239
Using Metadata Extensions BP-247
Operations BP-250
Daily Operations BP-250
Data Integration Load Traceability BP-252
Event Based Scheduling BP-259
High Availability BP-262
Load Validation BP-265
Repository Administration BP-270
SuperGlue Repository Administration BP-273
Third Party Scheduler BP-278
Updating Repository Statistics BP-282
PowerAnalyzer Configuration and Performance Tuning BP-288
Deploying PowerAnalyzer Objects BP-288
Installing PowerAnalyzer BP-299
PowerAnalyzer Security BP-306
Tuning and Configuring PowerAnalyzer and PowerAnalyzer Reports BP-314
Upgrading PowerAnalyzer BP-332
PowerCenter Configuration and Performance Tuning BP-334
Advanced Client Configuration Options BP-334
Advanced Server Configuration Options BP-339
Causes and Analysis of UNIX Core Files BP-344
Determining Bottlenecks BP-347
Managing Repository Size BP-352
Organizing and Maintaining Parameter Files & Variables BP-354
Performance Tuning Databases (Oracle) BP-359
Performance Tuning Databases (SQL Server) BP-371

Performance Tuning Databases (Teradata) BP-377
Performance Tuning UNIX Systems BP-380
Performance Tuning Windows NT/2000 Systems BP-387
Platform Sizing BP-390
Recommended Performance Tuning Procedures BP-393
Tuning Mappings for Better Performance BP-396
Tuning Sessions for Better Performance BP-409
Tuning SQL Overrides and Environment for Better Performance BP-417
Understanding and Setting UNIX Resources for PowerCenter Installations BP-431
Upgrading PowerCenter BP-435
Project Management BP-441
Assessing the Business Case BP-441
Defining and Prioritizing Requirements BP-444
Developing a Work Breakdown Structure (WBS) BP-447
Developing and Maintaining the Project Plan BP-449
Developing the Business Case BP-451
Managing the Project Lifecycle BP-453
Using Interviews to Determine Corporate Analytics Requirements BP-456
PWX Configuration & Tuning BP-463
PowerExchange Installation (for Mainframe) BP-463
Recovery BP-469
Running Sessions in Recovery Mode BP-469
Security BP-476
Configuring Security BP-476
SuperGlue BP-491
Custom XConnect Implementation BP-491
Customizing the SuperGlue Interface BP-496
Estimating SuperGlue Volume Requirements BP-503
SuperGlue Metadata Load Validation BP-506
SuperGlue Performance & Tuning BP-510
Using SuperGlue Console to Tune the XConnects BP-510


Database Sizing
Challenge
Database sizing involves estimating the types and sizes of the components of a data
architecture. This is important for determining the optimal configuration for your
database servers in order to support your operational workloads. Individuals involved in
a sizing exercise may be data architects, database administrators, and/or business
analysts.
Description
The first step in database sizing is to review system requirements to define such things
as:
expected data architecture elements (will there be staging areas? operational data
stores? centralized data warehouse and/or master data? data marts?)
expected source data volume
data granularity and periodicity
load frequency and method (full refresh? incremental updates?)
estimated growth rates over time and retained history

Determining Growth Projections
One way to estimate projections of data growth over time is to use scenario analysis.
As an example, for scenario analysis of a sales tracking data mart you can use the
number of sales transactions to be stored as the basis for the sizing estimate. In the
first year, 10 million sales transactions are expected; this equates to 10 million fact
table records.
Next, use the sales growth forecasts for the upcoming years for database growth
calculations. That is, an annual sales growth rate of 10 percent translates into 11
million fact table records for the next year. At the end of five years, the fact table is
likely to contain about 60 million records. You may want to calculate other estimates
based on five-percent annual sales growth (case 1) and 20-percent annual sales growth
(case 2). Multiple projections for best and worst case scenarios can be very helpful.
Baseline Volumetric

Next, use the physical data models for the sources and the target architecture to
develop a baseline sizing estimate. The administration guides for most DBMSs contain
sizing guidelines for the various database structures such as tables, indexes, sort
space, data files, log files, and database cache.
Develop a detailed sizing using a worksheet inventory of the tables and indexes from
the physical data model along with field data types and field sizes. Various database
products use different storage methods for data types. For this reason, be sure to use
the database manuals to determine the size of each data type. Add up the field sizes to
determine row size. Then use the data volume projections to determine the number of
rows to multiply by the table size.
The default estimate for index size is to assume same size as the table size. Also
estimate the temporary space for sort operations. For data warehouse applications
where summarizations are common, plan on large temporary spaces. The temporary
space can be as much as 1.5 times larger than the largest table in the database.
Another approach that is sometimes useful is to load the data architecture with
representative data and determine the resulting database sizes. This test load can be a
fraction of the actual data and is used only to gather basic sizing statistics. You will
then need to apply growth projections to these statistics. For example, after loading ten
thousand sample records to the fact table, you determine the size to be 10MB. Based
on the scenario analysis, you can expect this fact table to contain 60 million records
after five years. So, the estimated size for the fact table is about 60GB [i.e., 10 MB *
(60,000,000/10,000)]. Don't forget to add indexes and summary tables to the
calculations.
Guesstimating
When there is not enough information to calculate an estimate as described above, use
educated guesses and rules of thumb to develop as reasonable an estimate as
possible.
If you dont have the source data model, use what you do know of the source data
to estimate average field size and average number of fields in a row to
determine table size. Based on your understanding of transaction volume over
time, determine your growth metrics for each type of data and calculate out
your source data volume (SDV) from table size and growth metri cs.
If your target data architecture is not completed so that you can determine table
sizes, base your estimates on multiples of the SDV:
o If it includes staging areas: add another SDV for any source subject area
that you will stage multiplied by the number of loads youll retain in
staging.
o If you intend to consolidate data into an operational data store, add the
SDV multiplied by the number of loads to be retained in the ODS for
historical purposes (e.g., keeping 1 years worth of monthly loads = 12 x
SDV)
o Data warehouse architectures? based on the periodicity and granularity of
the DW, this may be another SDV + (.3n x SDV where n = number of
time periods loaded in the warehouse over time)
o If your data architecture includes aggregates, add a percentage of the
warehouse volumetrics based on how much of the warehouse data will be

aggregated and to what level (e.g., if the rollup level represents 10
percent of the dimensions at the details level, use 10 percent).
o Similarly, for data marts add a percentage of the data warehouse based on
how much of the warehouse data is moved into the data mart.
o Be sure to consider the growth projections over time and the history to be
retained in all of your calculations.
And finally, remember that there is always much more data than you expect so you
may want to add a reasonable fudge-factor to the calculations for a margin of safety.


Migration Procedures
Challenge
Develop a migration strategy that ensures clean migration between development, test,
QA, and production environments, thereby protecting the integrity of each of these
environments as the system evolves.
Description
Ensuring that an application has a smooth migration process between development,
quality assurance (QA), and production environments is essential for the deployment of
an application. Deciding which migration strategy works best for a project depends on
several factors.
1. How is the PowerCenter repository environment designed? Are there individual
repositories for development, QA, and production or are there just one or two
environments that share one or all of these phases.
2. How has the folder architecture been defined?
Each of these factors plays a role in determining the migration procedure that is most
beneficial to the project.
Informatica PowerCenter offers flexible migration options that can be adapted to fit the
need of each application. PowerCenter migration options include repository migration,
folder migration, object migration, and XML import/export. In versioned PowerCenter
repositories, users can also use static or dynamic deployment groups for migration,
which provides the capability to migrate any combination of objects within the
repository with a single command.
This Best Practice document is intended to help the development team decide which
technique is most appropriate for the project. The following sections discuss various
options that are available, based on the environment and architecture selected. Each
section describes the major advantages of its use as well as its disadvantages.
REPOSITORY ENVIRONMENTS
The following section outlines the migration procedures for standalone and distributed
repository environments. The distributed environment section touches on several
migration architectures, outlining the pros and cons of each. Also, please note that any

methods described in the Standalone section may also be used in a Distributed
environment.
STANDALONE REPOSITORY ENVIRONMENT

In a standalone environment, all work is performed in a single PowerCenter repository
that serves as the metadata store. Separate folders are used to represent the
development, QA, and production workspaces and segregate work. This type of
architecture within a single repository ensures seamless migration from development to
QA, and from QA to production.
The following example shows a typical architecture. In this example, the company has
chosen to create separate development folders for each of the individual developers for
development and unit test purposes. A single shared or common development folder,
SHARED_MARKETING_DEV, holds all of the common objects, such as sources, targets,
and reusable mapplets. In addition, two test folders are created for QA purposes. The
first contains all of the unit-tested mappings from the development folder. The second
is a common or shared folder that contains all of the tested shared objects. Eventually,
as the following paragraphs explain, two production folders will also be built.

Proposed Migration Process Single Repository
DEV to TEST Object Level Migration
Now that we've described the repository architecture for this organization, let's discuss
how it will migrate mappings to test, and then eventually to production.
After all mappings have completed their unit testing, the process for migration to test
can begin. The first step in this process is to copy all of the shared or common objects
from the SHARED_MARKETING_DEV folder to the SHARED_MARKETING_TEST folder.
This can be done using one of two methods:
The first, and most common method, is object migration via an object copy. In
this case, a user opens the SHARED_MARKETING_TEST folder and drags the
object from the SHARED_MARKETING_DEV into the appropriate workspace (i.e.
Source Analyzer, Warehouse Designer, etc.). This is similar to dragging a file
from one folder to another using Windows Explorer.

The second approach is object migration via object XML import/export. A user can
export each of the objects in the SHARED_MARKETING_DEV folder to XML, and
then re-import each object into the SHARED_MARKETING_TEST via XML import.
With the XML import/export, the XML files can be uploaded to a third party
versioning tool, if your organization has standardized on such a tool. Otherwise,
versioning can be enabled in PowerCenter. Migrations with versioned
PowerCenter repositories will be covered later in this document.
After you've copied all common or shared objects, the next step is to copy the
individual mappings from each development folder into the MARKETING_TEST folder.
Again, you can use either of the two object-level migration methods described above to
copy the mappings to the folder, although the XML import/export method is the most
intuitive method for resolving shared object conflicts. However, the migration method
is slightly different here when you're copying the mappings because you must ensure
that the shortcuts in the mapping are associated with the SHARED_MARKETING_TEST
folder. Designer will prompt the user to choose the correct shortcut folder that you
created in the previous example, which point to the SHARED_MARKETING_TEST (see
image below). You can then continue the migration process until all mappings have
been successfully migrated. In PowerCenter 7, you can export multiple objects into a
single XML file, and then also import them at the same time.

The final step in the process is to migrate the workflows that use those mappings.
Again, the object-level migration can be completed either through drag-and-drop or by
using XML import/export. In either case, this process is very similar to the steps
described above for migrating mappings, but differs in that the Workflow Manager
provides a Workflow Copy Wizard to step you through the process. The following steps
outline the full process for successfully copying a workflow and all of its associated
tasks.
1. The wizard prompts for the name of the new workflow. If a workflow with the
same name exists in the destination folder, the wizard prompts you to rename it

or replace it. If no such workflow exists, a default name will be used. Then click
Next to continue the copy process.
2. The next step for each task is to see if it exists (as shown below). If the task is
present, you can rename or replace the current one. If it does not exist, then
the default name is used (see below). Then click Next.

3. Next, the wizard prompts you to select the mapping associated with each
session task in the workflow. Select the mapping and continue by clicking
Next.

4. If connections exist in the target repository, the wizard will prompt you to select
the connection to use for the source and target. If no connections exist, the
default settings will be used. When this step is completed, click Finish and save
the work.
Initial Migration New Folders Created

The move to production is very different for the initial move than for subsequent
changes to mappings and workflows. Since the repository only contains folders for
development and test, we need to create two new folders to house the production-
ready objects. You will create these folders after testing of the objects in
SHARED_MARKETING_TEST and MARKETING_TEST has been approved.
The following steps outline the creation of the production folders and, at the same time,
address the initial test to production migration.
1. Open the PowerCenter Repository Manager client tool and log into the repository
2. To make a shared folder for the production environment, highlight the
SHARED_MARKETING_TEST folder, drag it, and drop it on the repository name.
The Copy Folder Wizard will appear and step you through the copying process

The first wizard screen asks if we want to use the typical folder copy options or the
advanced options. In this example, you will be using the advanced options.


The second wizard screen prompts the user to enter a folder name. By default, the
folder name that appears on this screen is the folder name followed by the date. In this
case, enter the name as SHARED_MARKETING_PROD.

The third wizard screen prompts the user to select a folder to override. Because this is
the first time you are transporting the folder, you wont need to select anything.


The final screen begins the actual copy process. Click Finish when it is complete.
Repeat this process to create the MARKETING_PROD folder. Use the MARKETING_TEST
folder as the original to copy and associate the shared objects with the
SHARED_MARKETING_PROD folder that was just created.
At the end of the migration, you should have two additional folders in the repository
environment for production: SHARED_MARKETING_PROD and MARKETING_ PROD (as
shown below). These folders contain the initially migrated objects. Before you can
actually run the workflow in these production folders, you need to modify the session
source and target connections to point to the production environment.

Incremental Migration Object Copy Example
Now that the initial production migration is complete, let's take a look at how future
changes will be migrated into the folder.

Any time an object is modified, it must be re-tested and migrated into production for
the actual change to occur. These types of changes in production take place on a case-
by-case or periodically scheduled basis. The following steps outline the process of
moving these objects individually.
1. Log into PowerCenter Designer. Open the destination folder and expand the
source folder. Click on the object to copy and drag-and-drop it into the
appropriate workspace window.
2. Because this is a modification to an object that already exists in the destination
folder, Designer will prompt you to choose whether to Rename or Replace the
object (as shown below). Choose the option to replace the object.

3. Beginning with PowerCenter 7, you can choose to compare conflicts whenever
migrating any object in Designer or Workflow Manager. By comparing the
objects, you can ensure that the changes that you are making are what you
intended. See below for an example of the mapping compare window.


4. After the object has been successfully copied, save the folder so the changes can
take place.
5. The newly copied mapping is now tied to any sessions the replaced mapping was
tied to.
6. Log into Workflow Manager and make the appropriate changes to the session or
workflow so it can update itself with the changes.
Standalone Repository Example
In this example, we will look at moving development work to the QA phase and then
from QA to production. In this example, we use multiple development folders for each
developer, with the test and production folders divided into the data mart they
represent. For this example, we focus solely on the MARKETING_DEV data mart, first
explaining how to move objects and mappings from each individual folder to the test
folder and then how to move tasks, worklets, and workflows to the new area.
Follow these steps to copy a mapping from Development to QA:
1. If using shortcuts, first follow these steps; if not using shortcuts, skip to step 2
o Copy the tested objects from the SHARED_MARKETING_DEV folder to the
SHARED_MARKETING_TEST folder.
o Drag all of the newly copied objects from the SHARED_MARKETING_TEST
folder to MARKETING_TEST.
o Save your changes.
2. Copy the mapping from Development into Test.

o In the PowerCenter Designer, open the MARKETING_TEST folder, and drag
and drop the mapping from each development folder into the
MARKETING_TEST folder.
o When copying each mapping in PowerCenter Designer will prompt to
Replace the object, Rename the object, Reuse the object, or Skip for
each reusable object, such as source and target definitions. Choose to
Reuse the object for all shared objects in the mappings copied into the
MARKETING_TEST folder.
3. If a reusable session task is being used follow these steps. Otherwise, skip to
step four.
o In the PowerCenter Workflow Manager, open the MARKETING_TEST folder,
and drag and drop each reusable session from the developers folders
into the MARKETING_TEST folder. A Copy Session Wizard will step you
through the copying process.
o Open each newly copied session and click on the Source tab. Change the
source to point to the source database for the Test environment.
o Click the Target tab. Change each connection to point to the target
database for the Test environment. Be sure to double-check the
workspace from within the Target tab to ensure that the load options are
correct.
4. While the MARKETING_TEST folder is still open, copy each workflow from
Development to Test.
o Drag each workflow from the development folders into the
MARKETING_TEST folder. The Copy Workflow Wizard will appear. Follow
the same steps listed above to copy the workflow to the new folder.
o As mentioned above, the copy wizard now allows conflicts to be compared
from within Workflow Manager to ensure that the correct migrations are
being made.
5. Implement the appropriate security.
o In Development, the owner of the folders should be a user(s) in the
development group.
o In Test, change the owner of the Test folder to a user(s) in the Test group.
o In Production, change the owner of the folders to a user in the Production
group.
o Revoke all rights to Public other than Read for the Production folders.
Disadvantages of a Single Repository Environment
The most significant disadvantage of a single repository environment is performance.
Having a development, QA, and production environment within a single repository can
cause degradation in production performance as the production environment shares
CPU and memory resources with the development and test environments. Although
these environments are stored in separate folders, they all reside within the same
database table space and on the same server.
For example, if development or test loads are running simultaneously with production
loads, the server machine may reach 100 percent utilization and production
performance will suffer.

A single repository structure also can create more confusion as the same users and
groups exist in all environments and the number of folders could exponentially
increase.
DISTRIBUTED REPOSITORY ENVIRONMENT

A distributed repository environment maintains separate, independent repositories,
hardware, and software for development, test, and production environments.
Separating repository environments is preferable for handling development to
production migrations. Because the environments are segregated from one another,
work performed in development cannot impact QA or production.
With a fully distributed approach, separate repositories function much like the separate
folders in a standalone environment. Each repository has a similar name, like the
folders in the standalone environment. For instance, in our Marketing example we
would have three repositories, INFADEV, INFATEST, and INFAPROD. In the following
example, we discuss a distributed repository architecture.
There are four techniques for migrating from development to production in a distributed
repository architecture, with each involving some advantages and disadvantages. In the
following pages, we discuss each of the migration options:
Repository Copy
Folder Copy
Object Copy
Deployment Groups

Repository Copy
So far, this document has covered object-level migrations and folder migrations
through drag-and-drop object copying and through object XML import/export. This
section of the document will cover migrations in a distributed repository environment
through repository copies.


The main advantages of this approach are:
The ability to copy all objects (mappings, workflows, mapplets, reusable
transformation, etc.) at once from one environment to another.
The ability to automate this process using pmrep commands. This eliminates much
of the manual processes that users typically perform.
Everything can be moved without breaking or corrupting any of the objects.
This approach also involves a few disadvantages.
The first is that everything is moved at once (which is also an advantage). The
problem with this is that everything is moved, ready or not. For example, we
may have 50 mappings in QA, but only 40 of them are production-ready. The 10
untested mappings are moved into production along with the 40 production-
ready mappings.
This leads to the second disadvantage, the maintenance required to remove any
unwanted or excess objects.
Another disadvantage is the need to adjust server variables, sequences,
parameters/variables, database connections, etc. Everything must be set up
correctly before the actual production runs can take place.
Lastly, the repository copy process requires that the existing Production repository
be deleted, and then the Test repository can be copied. This results in a loss of
production environment operational metadata such as load statuses, session run
times, etc. High performance organizations leverage the value of operational
metadata to track trends over time related to load success/failure and duration.
This metadata can be a competitive advantage for organizations that use this
information to plan for future growth.
Now that we've discussed the advantages and disadvantages, we will look at three
ways to accomplish the Repository Copy method:
Copying the Repository
Repository Backup and Restore
PMREP
Copying the Repository
Copying the Test repository to Production through the GUI client tools is the easiest of
all the migration methods. The task is very simple. First, ensure that all users are
logged out of the destination repository, then open the PowerCenter Repository
Administration Console client tool (as shown below).


1. If the Production repository already exists, you must delete the repository before
you can copy the Test repository. Before you can delete the repository, you
must stop it. Right-click the Production Repository and choose Stop. You can
delete the Production repository by selecting it and choosing Delete from the
context menu. You will want to actually delete the repository, not just remove it
from the server cache.
2. Now, create the Production repository connection by highlighting the
Repositories folder in the Navigator view and choosing New Repository. Enter
the connection information for the Production repository. Make sure to choose
the Do not create any content option.


3. Right-click the Production Repository and choose All Tasks -> Copy from.

4. In the dialog window, choose the name of the Test repository from the drop
down menu. Enter the username and password of the Test repository.


5. Click Ok and the copy process will begin.
6. When you've successfully copied the repository to the new location, exit from
the Repository Server Administration Console.
7. In the Repository Manager, double click on the newly copied repository and log
in with a valid username and password.
8. Verify connectivity, then highlight each folder individually and rename them. For
example, rename the MARKETING_TEST folder to MARKETING_PROD, and the
SHARED_MARKETING_TEST to SHARED_MARKETING_PROD.
9. Be sure to remove all objects that are not pertinent to the Production
environment from the folders before beginning the actual testing process.
10. When this cleanup is finished, you can log into the repository through the
Workflow Manager. Modify the server information and all connections so they
are updated to point to the new Production locations for all existing tasks and
workflows.
Repository Backup and Restore
Backup and Restore Repository is another simple method of copying an entire
repository. This process backs up the repository to a binary file that can be restored to
any new location. This method is preferable to the repository copy process because if
any type of error occurs, the file is backed up to the binary file on the repository server.
The following steps outline the process of backing up and restoring the repository for
migration.
1. Launch the PowerCenter Repository Server Administration Console client,
connect to the repository server, and highlight the Test repository.
2. Select Action -> All Tasks -> Backup from the menu. A screen will appear and
prompt you to supply a name for the backup file as well as the Administrator
username and password. The file will be saved to the Backup directory within
the repository servers home directory.


3. After you've selected the location and file name, click the OK button and the
backup process will begin.
The backup process will create a .rep file containing all repository information. Stay
logged into the Manage Repositories screen. When the backup is complete, select the
repository connection to which the backup will be restored to (Production repository), or
create the connection if it does not already exist. Follow these steps to complete the
repository restore:
1. Right-click destination repository and click the All Tasks -> Restore.


2. The system will prompt you to supply a username, password, and the name of
the file to be restored. Enter the appropriate information and click the Restore
button.
When the restoration process is complete, you must repeat the steps listed in the copy
repository option in order to delete all of the unused objects and renaming of the
folders.
PMREP
Using the PMREP commands is essentially the same as the Backup and Restore
Repository method except that it is run from the command line rather than through the
GUI client tools. PMREP utilities can be used from the Informatica Server or from any
client machine connected to the server.
Refer to the Repository Manager Guide for a list of PMREP commands.
The following is a sample of the command syntax used within a Windows batch file to
connect to and backup a repository. Using the code example below as a model, you can
write scripts to be run on a daily basis to perform functions such as connect, backup,
restore, etc:
backupproduction.bat
REM This batch file uses pmrep to connect to and back up the repository Production on
the server Central

@echo off
echo Connecting to Production repository...
C:\Program Files\Informatica PowerCenter 7.1.1\RepositoryServer\bin\pmrep connect
-r INFAPROD -n Administrator -x Adminpwd h infarepserver o 7001
echo Backing up Production repository...
C:\Program Files\Informatica PowerCenter 7.1.1\RepositoryServer\bin\pmrep backup
-o c:\backup\Production_backup.rep
Post-Repository Migration Cleanup
After you have used one of the repository migration procedures described above to
migrate into Production, follow these steps to convert the repository to Production:
1. Disable workflows that are not ready for Production or simply delete the
mappings, tasks, and workflows.
o Disable the workflows not being used in the Workflow Manager by opening
the workflow properties, and then checking the Disabled checkbox under
the General tab.
o Delete the tasks not being used in the Workflow Manager and the mappings
in the Designer.
2. Modify the database connection strings to point to the production sources and
targets.
o In the Workflow Manager, select Relational connections from the
Connections menu.
o Edit each relational connection by changing the connect string to point to
the production sources and targets.
o If using lookup transformations in the mappings and the connect string is
anything other than $SOURCE or $TARGET, then the connect string will
need to be modified appropriately.
3. Modify the pre- and post-session commands and SQL as necessary.
o In the Workflow Manager, open the session task properties, and from the
Components tab make the required changes to the pre- and post-session
scripts.
4. Implement appropriate security, such as:
o In Development, ensure that the owner of the folders is a user in the
development group.
o In Test, change the owner of the test folders to a user in the test group.
o In Production, change the owner of the folders to a user in the production
group.
FOLDER COPY

Although deployment groups are becoming a very popular migration method, the folder
copy method has historically been the most popular way to migrate in a distributed

environment. Copying an entire folder allows you to quickly promote all of the objects
located within that folder. All source and target objects, reusable transformations,
mapplets, mappings, tasks, worklets and workflows are promoted at once. Because of
this, however, everything in the folder must be ready to migrate forward. If some
mappings or workflows are not valid, then developers (or the Repository Administrator)
must manually delete these mappings or workflows from the new folder after the folder
is copied.
The following examples step through a sample folder copy process using three separate
repositories (one each for Development, Test, and Production) and using two
repositories (one for development and test, one for production).
The three advantages of using the folder copy method are:
The Repository Managers Folder Copy Wizard makes it almost seamless to copy an
entire folder and all the objects located within it.
If the project uses a common or shared folder and this folder is copied first, then
all shortcut relationships are automatically converted to point to this newly
copied common or shared folder.
All connections, sequences, mapping variables, and workflow variables are copied
automatically.
The primary disadvantage of the folder copy method is that the repository is locked
while the folder copy is being performed. Therefore, it is necessary to schedule this
migration task during a time when the repository is least utilized. Please keep in mind
that a locked repository means than NO jobs can be launched during this process. This
can be a serious consideration in real-time or near real-time environments.
The following example steps through the process of copying folders from each of the
different environments. The first example uses three separate repositories for
development, test, and production.
1. If using shortcuts, follow these sub steps; otherwise skip to step 2:
o Open the Repository Manager client tool.
o Connect to both the Development and Test repositories.
o Highlight the folder to copy and drag it to the Test repository.
o The Copy Folder Wizard will appear and step you through the copy process.
o When the folder copy process is complete, open the newly copied folder in
both the Repository Manager and Designer to ensure that the objects
were copied properly.
2. Copy the Development folder to Test
If you skipped step 1, follow these sub-steps
o Open the Repository Manager client tool.
o Connect to both the Development and Test repositories.
Highlight the folder to copy and drag it to the Test repository.
The Copy Folder Wizard will appear.


Follow these steps to ensure that all shortcuts are reconnected.
o Use the advanced options when copying the folder across.
o Select Next to use the default name of the folder
If the folder already exists in the destination repository, choose to replace the
folder.

The following screen will appear prompting you to select the folder where the new
shortcuts are located.


In an situation where the folder names do not match, a folder compare will take
place. The Copy Folder wizard will then complete the folder copy process.
Rename the folder to the appropriate name and implement the security.
3. When testing is complete, repeat the steps above to migrate to the Production
repository.
When the folder copy process is complete, log onto the Workflow Manager and change
the connections to point to the appropriate target location. Ensure that all tasks
updated correctly and that folder and repository security is modified for test and
production.
Object Copy
Copying mappings into the next stage in a networked environment involves many of
the same advantages and disadvantages as in the standalone environment, but the
process of handling shortcuts is simplified in the networked environment. For additional
information, see the earlier description of Object Copy for the standalone environment.
One advantage of Object Copy in a distributed environment is that it provides more
granular control over objects.
Two distinct disadvantages of Object Copy in a distributed environment are:
Much more work to deploy an entire group of objects
Shortcuts must exist prior to importing/copying mappings
Below are the steps to complete an object copy in a distributed repository environment:
1. If using shortcuts, follow these sub-steps, otherwise skip to step 2:
o In each of the distributed repositories, create a common folder with the
exact same name and case.

o Copy the shortcuts into the common folder in Production, making sure the
shortcut has the exact same name.
2. Copy the mapping from the Test environment into Production.
o In the Designer, connect to both the Test and Production repositories and
open the appropriate folders in each.
o Drag-and-drop the mapping from Test into Production.
o During the mapping copy, PowerCenter 7 allows a comparison of this
mapping to an existing copy of the mapping already in Production. Also,
in PowerCenter 7, the ability to compare objects is not limited to
mappings, but is available for all repository objects including workflows,
sessions, and tasks.
3. Create or copy a workflow with the corresponding session task in the Workflow
Manager to run the mapping (first ensure that the mapping exists in the current
repository).
o If copying the workflow, follow the Copy Wizard.
o If creating the workflow, add a session task that points to the mapping and
enter all the appropriate information.
4. Implement appropriate security.
o In Development, ensure the owner of the folders is a user in the
development group.
o In Test, change the owner of the test folders to a user in the test group.
o In Production, change the owner of the folders to a user in the production
group.
Deployment Groups
For versioned repositories, the use of Deployment Groups for migrations between
distributed environments allows the most flexibility and convenience. With Deployment
Groups, the user has the flexibility of migrating individual objects as you would in an
object copy migration, but also has the convenience of a repository or folder level
migration as all objects are deployed at once. The objects included in a deployment
group have no restrictions and can come from one or multiple folders. Additionally, a
user can set up a dynamic deployment group which allows the objects in the
deployment group to be defined by a repository query, rather than being added to the
deployment group manually, therefore creating additional convenience. Lastly, since
deployment groups are available on versioned repositories, they also have the
capability to be rolled back, reverting to the previous versions of the objects, when
necessary.
Creating a Deployment Group
Below are the steps to create a deployment group:
1. Launch the Repository Manager client tool and log in to the source repository.
2. Expand the repository, right-click on Deployment Groups and choose New
Group.


3. In the dialog window, give the deployment group a name, and choose whether
it should be static or dynamic. In this example, we are creating a static
deployment group. Choose OK.

Adding Objects to a Static Deployment Group
Below are the steps to add objects to a static deployment group:

1. In Designer, Workflow Manager, or Repository Manger right-click an object that
you want to add to the deployment group and choose Versioning -> View
History. The View History window will be displayed.

2. In the View History window, right-click the object and choose Add to
Deployment Group.


3. In the Deployment Group dialog window, choose the deployment group that you
want to add the object to, and choose OK.

4. In the final dialog window, choose whether you want to add dependent objects.
In most cases, you will want to add dependent objects to the deployment group
so that they will be migrated as well. Choose OK.


NOTE: The All Dependencies should be used for any new code that is
migrating forward. However, it can cause issues when moving existing code
forward. The reason is that All Dependencies flags Shortcuts also. During the
deployment, Informatica will try to re-insert or replace the Shortcuts. This does
not work, and it will cause your deployment to fail.
5. The object will be added to the deployment group at this time.
Although the deployment group allows the most flexibility at this time, the task of
adding each object to the deployment group is similar to the effort required for an
object copy migration. To make deployment groups easier to use, PowerCenter allows
the capability to create dynamic deployment groups.
Adding Objects to a Dynamic Deployment Group
Dynamic Deployment groups are similar to static deployment groups in their function,
but differ based on how objects are added to the deployment group. In a static
deployment group, objects are manually added to the deployment group one by one.
In a dynamic deployment group, the contents of the deployment group are defined by a
repository query. Dont worry about the complexity of writing a repository query, it is
quite simple and aided by the PowerCenter GUI interface.
Below are the steps to add objects to a dynamic deployment group:
1. First, create a deployment group, just as you did for a static deployment group,
but in this case, choose the dynamic option. Also, select the Queries button.


2. The Query Browser window is displayed. Choose New to create a query for
the dynamic deployment group.

3. In the Query Editor window, provide a name and query type (Shared). Define
criteria for the objects that should be migrated. The drop down list of
parameters allows a user to choose from 23 predefined metadata categories. In
this case, the developers have assigned the RELEASE_20050130 label to all
objects that need to be migrated, so the query is defined as Label Is Equal To
RELEASE_20050130. The creation and application of labels are discussed in a
separate Velocity Best Practice.


4. Save the Query and exit the Query Editor. Choose OK on the Query Browser
window, and Choose on the Deployment Group editor window.
Executing a Deployment Group Migration
A Deployment Group migration can be executed through the Repository Manager client
tool, or through the pmrep command line utility. In the client tool, a user simply drags
the deployment group from the source repository and drops it on the destination
repository. This prompts the Copy Deployment Group wizard which will walk a user
through the step-by-step options for executing the deployment group.
Rolling back a Deployment
In order to roll back a deployment, one must locate the Deployment via the TARGET
Repositories menu bar. Deployments -> History -> View History -> Rollback button.
Automated Deployments
For the optimal migration method, users can set up a UNIX shell or Windows batch
script that calls the pmrep DeployDeploymentGroup command, which will execute a
deployment group migration without human interaction. This is ideal as the
deployment group allows ultimate flexibility and convenience as the script can be
scheduled to run overnight having minimal impact on developers and the PowerCenter
administrator. You can also use the pmrep utility to automate importing objects via
XML.
Recommendations

Informatica recommends using the following process when running in a three-tiered
environment with Development, Test, and Production servers.

Non-Versioned Repositories
For migrating from Development into Test, Informatica recommends using the Object
Copy method. This method gives you total granular control over the objects that are
being moved. It also ensures that the latest Development mappings can be moved over
manually as they are completed. For recommendations on performing this copy
procedure correctly, see the steps listed in the Object Copy section.
Versioned Repositories
For versioned repositories, Informatica recommends using the Deployment Groups
method for repository migration in a distributed repository environment. This method
provides the most level of flexibility in that you can promote any object from within a
Development repository (even across folders) into any destination repository. Also, by
using labels, dynamic deployment groups, and the enhanced pmrep command line
utility, the use of the deployment group migration method will result in automated
migrations that can be executed without manual intervention.
THIRD PARTY VERSIONING

Some organizations have standardized on a third party version control software
package. In these cases, PowerCenters XML import/export functionality offers
integration with those software packages and provides a means to migrate objects. This
method is most useful in a distributed environment because objects can be exported
into an XML file from one repository and imported into the destination repository.
The XML Object Copy Process allows you to copy nearly all repository objects, including
sources, targets, reusable transformations, mappings, mapplets, workflows, worklets,
and tasks. Beginning with PowerCenter 7, the export/import functionality was

enhanced to allow the export/import of multiple objects to a single XML file. This can
significantly cut down on the work associated with object level XML import/export.
The following steps outline the process of exporting the objects from source repository
and importing them into the destination repository:
EXPORTING
1. From Designer or Workflow Manager, login to the source repository. Open the
folder and highlight the object to be exported.
2. Select Repository -> Export Objects
3. The system will prompt you to select a directory location on the local
workstation. Choose the directory to save the file. Using the default name for
the XML file is generally recommended.
4. Open Windows Explorer and go to the C:\Program Files\Informatica PowerCenter
7.x\Client directory. (This may vary depending on where you installed the client
tools.)
5. Find the powrmart.dtd file, make a copy of it, and paste the copy into the
directory where you saved the XML file.
6. Together, these files are now ready to be added to the version control software
IMPORTING
1. Log into Designer or the Workflow Manager client tool and login to the
destination repository. Open the folder where the object is to be imported.
2. Select Repository -> Import Objects.
3. The system will prompt you to select a directory location and file to import into
the repository.
4. The following screen will appear with the steps for importing the object.

Select the mapping and add it to the Objects to Import list.


Click the Next, and then click the Import button. Since the shortcuts have been
added to the folder, the mapping will now point to the new shortcuts and their
parent folder.
It is important to note that the pmrep command line utility has been greatly enhanced
in PowerCenter 7 such that the activities associated with XML import/export can be
automated through pmrep.


Data Cleansing
Challenge
Accuracy is one of the biggest obstacles to the success of many data warehousing
projects. If users discover data inconsistencies, they may lose faith in the entire
warehouse's data. However, it is not unusual to discover that as many as half the
records in a database contain some type of information that is incomplete, inconsistent,
or incorrect. The challenge is, therefore, to cleanse data online, at the point of entry
into the data warehouse or operational data store (ODS), to ensure that the
warehouse/ODS provides consistent and accurate data for business decision-making.
A significant portion of time in the development process should be set-aside for setting
up the data quality assurance process and implementing whatever data cleansing is
needed. In a production environment, data quality reports should be generated after
each data warehouse implementation or when new source systems are added to the
integrated environment. There should also be provision for rolling back if data quality
testing indicates that the data is unacceptable.
Description
Informatica has several partners in the data-cleansing arena. Rapid implementation,
tight integration, and a fast learning curve are the key differentiators for picking the
right data-cleansing tool to for your project.
Informaticas data-quality partners provide quick start templates for standardizing,
address correcting, and matching records for best-build logic, which can be tuned to
your business rules. Matching and consolidating is the most crucial step to guarantee a
single view of subject area (e.g., Customers) so everyone in the enterprise can make
better business decisions.
Concepts
Following is a list of steps to organize and implement a good data quality strategy.
These data quality concepts provide a foundation that helps to develop a clear picture
of the subject data, which can improve both efficiency and effectiveness.
Parsing the process of extracting individual elements within the records, files, or
data entry forms to check the structure and content of each field. For example, name,
title, company name, phone number, and SSN.

Correction the process of correcting data using sophisticated algorithms and
secondary data sources, to check and validate information. Example validating
addresses with postal directories.
Standardize arranging information in a consistent manner and preferred format.
Examples include removal of dashes from phone numbers or SSN.
Enhancement adding useful, but optional, information to supplement existing data
or complete data. Examples may include sales volume, number of employees for a
given business.
Matching once a high-quality record exists, then eliminate any redundancies. Use
match standards and specific business rules to identify records that may refer to the
same customer.
Consolidate using the data found during matching to combine all of the similar data
into a single consolidated view. Examples are building best record, master record, or
house holding.
Partners
Following is list of data quality partners and their respective tools:
DataMentors - Provides tools that are run before the data extraction and load process
to clean source data. Available tools are:

DM
DataFuse
TM
- a data cleansing and house-holding system with the power to
accurately standardize and match data.

DM
ValiData
TM
- an effective data analysis system that profiles and identifies
inconsistencies between data and metadata.

DM
Utils - a powerful non-compiled scripting language that operates on flat ASCII or
delimited files. It is primarily used as a query and reporting tool. Additionally, it
also provides a way to reformat and summarize files.
FirstLogic - FirstLogic offers direct interfaces to PowerCenter during the extract and
load process, as well as providing pre-data extraction data cleansing tools like
DataRight, ACE (address correction and enhancement), and Match and Consolidate
(formally Merge/Purge). The data cleansing interfaces as transformation components,
using the PowerCenter External Procedures or Advanced External Procedure calls. Thus,
these transformations can be dragged and dropped seamlessly into a PowerCenter
mapping for parsing, standardization, cleansing, enhancement, and matching of the
names, business, and address information during the PowerCenter ETL process of
building a data mart or data warehouse.
Paladyne - The flagship product, Datagration is an open, flexible data quality system
that can repair any type of data (in addition to its name and address) by incorporating
custom business rules and logic. Datagration's Data Discovery Message Gateway
feature assesses data cleansing requirements using automated data discovery tools
that identify data patterns. Data Discovery enables Datagration to search through a
field of free form data and re-arrange the tokens (i.e., words, data elements) into a

logical order. Datagration supports relational database systems and flat files as data
sources and any application that runs in batch mode.
Trillium - Trillium's eQuality customer information components (a web enabled tool)
are integrated with the PowerCenter Transformation Exchange modules and reside on
the same server as the PowerCenter transformation engine. As a result, Informatica
users can invoke Trillium's four data quality components through an easy-to-use
graphical desktop object. The four components are:
Converter: data analysis and investigation module for discovering word patterns
and phrases within free-form text.
Parser: processing engine for data cleansing, elementizing, and standardizing
customer data.
Geocoder: an internationally-certified postal and census module for address
verification and standardization.
Matcher: a module designed for relationship matching and record linking.
Innovative Systems - The i/Lytics solution operates within PowerMart and
PowerCenter version 6.x to provide smooth, seamless project flow. Using its unique
knowledgebase of more than three million words and word patterns, i/Lytics cleanses,
standardizes, links, and house holds customer data to create a complete and accurate
customer profile each time a record is added or updated.
iORMYX International Inc. - iORMYX's df Informatica Adapter allows you to use
Dataflux data quality capabilities directly within PowerCenter and PowerMart using
advanced External Procedures. Within PowerCenter you can easily drag Dataflux
transformations into the workflows. Additionally by utilizing the data profiling
capabilities of Dataflux, you can design ETL workflows that are successful from the start
and build targeted, data-specific business rules that enhance data quality from within
PowerCenter. The integrated solution significantly improves accuracy and effectiveness
of business intelligence and enterprise systems by providing standardized and accurate
data.
Integration Examples
The following sections describe how to integrate two of the tools with PowerCenter.
FirstLogic - ACE
The following graphic illustrates a high level flow diagram of the data cleansing process.

Use the Informatica Advanced External Transformation process to interface with the
FirstLogic module by creating a "Matching Link" transformation. That process uses the
Informatica Transformation Developer to create a new Advanced External
Transformation, which incorporates the properties of the FirstLogic Matching Link files.
Once a Matching Link transformation has been created in the Transformation
Developer, users can incorporate that transformation into any of their project
mappings; it's reusable from the repository.
When an PowerCenter session starts, the transformation is initialized. The initialization
sets up the address processing options, allocates memory, and opens the files for
processing. This operation is only performed once. As each record is passed into the
transformation, it is parsed and standardized. Any output components are created and
passed to the next transformation. When the session ends, the transformation is
terminated. The memory is once again available and the directory files are closed.
The available functions / processes are as follows.
ACE Processing
There are four ACE transformations to choose from. Three base transformations parse,
standardize, and append address components using FirstLogic's ACE Library. The
transformation choice depends on the input record layout. The fourth transformation
can provide optional components. This transformation must be attached to one of the
three base transformations.
The four transformations are:
1. ACE_discrete - where the input address data is presented in discrete fields.
2. ACE_multiline - where the input address data is presented in multiple lines (1-
6).
3. ACE_mixed - where the input data is presented with discrete city/state/zip and
multiple address lines (1-6).
4. Optional transform - which is attached to one of the three base transforms
and outputs the additional components of ACE for enhancement.
All records input into the ACE transformation are returned as output. ACE returns
Error/Status Code information during the processing of each address. This allows the
end user to invoke additional rules before the final load has completed.
TrueName Process
TrueName mirrors the ACE base transformations with discrete, multi-line, and mixed
transformations. A fourth and optional transformation available in this process can be
attached to one of the three base transformations to provide genderization and match
standards enhancements. TrueName generates error and status codes. Similar to ACE,
all records entered as input into the TrueName transformation can be used as output.
Matching Process
The matching process works through one transformation within the Informatica
architecture. The input data is read into the PowerCenter data flow similar to a batch

file. All records are read, the break groups created and, in the last step, matches are
identified. Users set up their own matching transformation through the PowerCenter
Designer by creating an advanced external procedure transformation. Users can select
which records are outputs from the matching transformations by editing the
initialization properties of the transformation.
All matching routines are predefined and, if necessary, the configuration files can be
accessed for additional tuning. The five predefined matching scenarios include:
individual, family, household (the only difference between household and family, is the
household doesn't match on last name), firm individual, and firm. Keep in mind that the
matching does not do any data parsing; this must be accomplished prior to using this
transformation. As with ACE and TrueName, error and status codes are reported.
Trillium
Integration to Trillium's data cleansing software is achieved through the Informatica
Trillium Advanced External Procedures (AEP) interface.
The AEP modules incorporate the following Trillium functional components.
Trillium Converter - The Trillium Converter facilitates data conversion such as
EBCDIC to ASCII, integer to character, character length modification, literal
constant, and increasing values. It can also be used to create unique record
identifiers, omit unwanted punctuation, or translate strings based on actual data
or mask values. A user-customizable parameter file drives the conversion
process. The Trillium Converter is a separate transformation that can be used
standalone or in conjunction with the Trillium Parser module.
Trillium Parser - The Trillium Parser identifies and/or verifies the components of
free-floating or fixed field name and address data. The primary function of the
Parser is to partition the input address records into manageable components in
preparation for postal and census geocoding. The parsing process is highly table
driven to allow for customization of name and address identification to specific
requirements.
Trillium Postal Geocoder - The Trillium Postal Geocoder matches an address
database to the ZIP+4 database of the U.S. Postal Service (USPS).
Trillium Census Geocoder - The Trillium Census Geocoder matches the address
database to U.S. Census Bureau information.
Each record that passes through the Trillium Parser external module is first parsed
then, optionally postal geocoded and census geocoded. The level of geocoding
performed is determined by a user-definable initialization property.
Trillium Window Matcher - The Trillium Window Matcher allows the PowerCenter
Server to invoke Trillium's de-duplication and house holding functionality. The
Window Matcher is a flexible tool designed to compare records to determine the
level of similarity between them. The result of the comparisons is considered a
passed, a suspect, or a failed match depending upon the likeness of data
elements in each record, as well as a scoring of their exceptions.
Input to the Trillium Window Matcher transformation is typically the sorted output of
the Trillium Parser transformation. Another method to obtain sorted information is to

use the sorter transformation, which became available in the PowerCenter 6.0
release. Other options for sorting include:
Using the Informatica Aggregator transformation as a sort engine.
Separate the mappings whenever a sort is required. The sort can be run as a
pre/post session command between mappings. Pre/post session commands are
configured in the Workflow Manager.
Build a custom AEP Transformation to include in the mapping.


Data Connectivity using PowerCenter Connect for BW
Integration Server
Challenge
Understanding how to use PowerCenter Connect for SAP BW to load data into the SAP
BW (Business Information Warehouse).
Description
The PowerCenter Connect for SAP BW supports the SAP Business Information
Warehouse as both a source and target.
Extracting Data from BW
PowerCenter Connect for SAP BW lets you extract data from SAP BW to use as a source
in a PowerCenter session. PowerCenter Connect for SAP BW integrates with the Open
Hub Service (OHS), SAPs framework for extracting data from BW. OHS uses data from
multiple BW data sources, including SAP's InfoSources and InfoCubes. The OHS
framework includes InfoSpoke programs, which extract data from BW and write the
output to SAP transparent tables.
Loading Data into BW
PowerCenter Connect for SAP BW lets you import BW target definitions into the
Designer and use the target in a mapping to load data into BW. PowerCenter Connect
for SAP BW uses Business Application Program Interface (BAPI), to exchange metadata
and load data into BW.
PowerCenter can use SAPs business content framework to provide a high-volume data
warehousing solution or SAPs Business Application Program Interface (BAPI), SAPs
strategic technology for linking components into the Business Framework, to exchange
metadata with BW.
PowerCenter extracts and transforms data from multiple sources and uses SAPs high-
speed bulk BAPIs to load the data into BW, where it is integrated with industry-specific
models for analysis through the SAP Business Explorer tool.
Using PowerCenter with PowerCenter Connect to Populate BW

The following paragraphs summarize some of the key differences in using PowerCenter
with the PowerCenter Connect to populate a SAP BW rather than working with standard
RDBMS sources and targets.
BW uses a pull model. The BW must request data from a source system before the
source system can send data to the BW. PowerCenter must first register with
the BW using SAPs Remote Function Call (RFC) protocol.
The native interface to communicate with BW is the Staging BAPI, an API
published and supported by SAP. Three of the PowerCenter product suite use
this API. The PowerCenter Designer uses the Staging BAPI to import metadata
for the target transfer structures. The PowerCenter Integration Server for BW
uses the Staging BAPI to register with BW and receive requests to run sessions.
The PowerCenter Server uses the Staging BAPI to perform metadata verification
and load data into BW.
Programs communicating with BW use the SAP standard saprfc.ini file to
communicate with BW. The saprfc.ini file is similar to the tnsnames file in Oracle
or the interface file in Sybase. The PowerCenter Designer reads metadata from
BW and the PowerCenter Server writes data to BW.
BW requires that all metadata extensions be defined in the BW Administrator
Workbench. The definition must be imported to Designer. An active structure is
the target for PowerCenter mappings loading BW.
Because of the pull model, BW must control all scheduling. BW invokes the
PowerCenter session when the InfoPackage is scheduled to run in BW.
BW only supports insertion of data into BW. There is no concept of update or
deletes through the staging BAPI.

Steps for Extracting Data from BW
The process of extracting data from SAP BW is quite similar to extracting data from
SAP. Similar transports are used on the SAP side, and data type support is the same as
that supported for SAP PowerCenter Connect.
The steps required for extracting data are:
1. Create an InfoSpoke. Create an InfoSpoke in the BW to extract the data from
the BW database and write it to either a database table or a file output target.
2. Import the ABAP program. Import the ABAP program Informatica provides
that calls the workflow created in the Workflow Manager.
3. Create a mapping. Create a mapping in the Designer that uses the database
table or file output target as a source.
4. Create a workflow to extract data from BW. Create a workflow and session
task to automate data extraction from BW.
5. Create a Process Chain.A BW Process Chain links programs together to run in
sequence. Create a Process Chain to link the InfoSpoke and ABAP programs
together.
6. Schedule the data extraction from BW. Set up a schedule in BW to automate
data extraction.
Steps To Load Data into BW
1. Install and Configure PowerCenter Components.

The installation of the PowerCenter Connect for BW includes a client and a
server component. The Connect server must be installed in the same directory
as the PowerCenter Server. Informatica recommends installing Connect client
tools in the same directory as the PowerCenter Client. For more details on
installation and configuration refer to the PowerCenter and the PowerCenter
Connect installation guides.
2. Build the BW Components.
To load data into BW, you must build components in both BW and PowerCenter.
You must first build the BW components in the Administrator Workbench:
Define PowerCenter as a source system to BW. BW requires an external source
definition for all non-R/3 sources.
The InfoSource represents a provider structure. Create the InfoSource in the BW
Administrator Workbench and import the definition into the PowerCenter
Warehouse Designer.
Assign the InfoSource to the PowerCenter source system. After you create an
InfoSource, assign it to the PowerCenter source system.
Activate the InfoSource. When you activate the InfoSource, you activate the
InfoObjects and the transfer rules.
3. Configure the sparfc.ini file.
Required for PowerCenter and Connect to connect to BW.
PowerCenter uses two types of entries to connect to BW through the saprfc.ini
file:
Type A. Used by PowerCenter Client and PowerCenter Server. Specifies the BW
application server.
Type R. Used by the PowerCenter Connect for BW. Specifies the external program,
which is registered at the SAP gateway.
Do not use Notepad to edit the sparfc.ini file because Notepad can corrupt the
file. Set RFC_INI environment variable for all Windows NT, Windows 2000, and
Windows 95/98 machines with saprfc.ini file. RFC_INI is used to locate the
saprfc.ini.
4. Start the Connect for BW server
Start Connect for BW server only after you start PowerCenter Server and before
you create InfoPackage in BW.
5. Build mappings
Import the InfoSource into the PowerCenter repository and build a mapping
using the InfoSource as a target.
The following restrictions apply to building mappings with BW InfoSource target:

You cannot use BW as a lookup table.
You can use only one transfer structure for each mapping.
You cannot execute stored procedure in a BW target.
You cannot partition pipelines with a BW target.
You cannot copy fields that are prefaced with /BIC/ from the InfoSource definition
into other transformations.
You cannot build an update strategy in a mapping. BW supports only inserts; it
does not support updates or deletes. You can use Update Strategy
transformation in a mapping, but the Connect for BW Server attempts to insert
all records, even those marked for update or delete.
6. Load data
To load data into BW from PowerCenter, both PowerCenter and the BW system
must be configured.
Use the following steps to load data into BW:
Configure a workflow to load data into BW. Create a session in a workflow that
uses a mapping with an InfoSource target definition.
Create and schedule an InfoPackage. The InfoPackage associates the PowerCenter
session with the InfoSource.
When the Connect for BW Server starts, it communicates with the BW to register
itself as a server. The Connect for BW Server waits for a request from the BW to
start the workflow. When the InfoPackage starts, the BW communicates with the
registered Connect for BW Server and sends the workflow name to be scheduled
with the PowerCenter Server. The Connect for BW Server reads information
about the workflow and sends a request to the PowerCenter Server to run the
workflow.
The PowerCenter Server validates the workflow name in the repository and the
workflow name in the InfoPackage. The PowerCenter Server executes the
session and loads the data into BW. You must start the Connect for BW Server
after you restart the PowerCenter Server.
Supported Datatypes
The PowerCenter Server transforms data based on the Informatica transformation
datatypes. BW can only receive data in 250 bytes per packet. The PowerCenter Server
converts all data to a CHAR datatype and puts it into packets of 250 bytes, plus one
byte for a continuation flag.
BW receives data until it reads the continuation flag set to zero. Within the transfer
structure, BW then converts the data to the BW datatype. Currently, BW only supports
the following datatypes in transfer structures assigned to BAPI source systems
(PowerCenter ): CHAR,CUKY,CURR,DATS,NUMC,TIMS,UNIT
All other datatypes result in the following error in BW:
Invalid data type (data type name) for source system of type BAPI.

Date/Time Datatypes
The transformation date/time datatype supports dates with precision to the second. If
you import a date/time value that includes milliseconds, the PowerCenter Server
truncates to seconds. If you write a date/time value to a target column that supports
milliseconds, the PowerCenter Server inserts zeros for the millisecond portion of the
date.
Binary Datatypes
BW does not allow you to build a transfer structure with binary datatypes. Therefore,
you cannot load binary data from PowerCenter into BW.
Numeric Datatypes
PowerCenter does not support the INT1 datatype.
Performance Enhancement for Loading into SAP BW
If you see a performance slowdown for sessions that load into SAP BW, set the default
buffer block size to 15-20MB to enhance performance. You can put 5,000-10,000 rows
per block, so you can calculate the buffer block size needed with the following formula:
Row size x Rows per block = Default Buffer Block size
For example, if your target row size is 2 KB: 2 KB x 10,000 = 20 MB.


Data Connectivity using PowerCenter Connect for
MQSeries
Challenge
Understanding how to use MQSeries Applications in PowerCenter mappings.
Description
MQSeries Applications communicate by sending each other messages rather than by
calling each other directly. Applications can also request data using a "request
message" on a message queue. Because no open connections are needed between
systems, they can run independently of one another. MQSeries enforces No Structure
on the content or format of the message; this is defined by the application.
The following features and functions are not available to PowerCenter when using
MQSeries:
Lookup transformations can be used in an MQSeries mapping, but lookups on
MQSeries sources are not allowed.
Certain considerations are necessary when using AEPs, aggregators, custom
transformations, joiners, sorters, rank, or transaction control transformations
because they can only be performed on one queue, as opposed to a full data set.

MQSeries Architecture
IBM MQSeries is a messaging and queuing application that permits programs to
communicate with one another across heterogeneous platforms and network protocols
using a consistent application-programming interface.
PowerCenter Connect for MQSeries architecture has three parts:
Queue Manager, which provides administrative functions for queues and
messages.
Message Queue, which is a destination to which messages can be sent.
MQSeries Message, which incorporates a header and a data component.
Queue Manager

Informatica connects to Queue Manager to send and receive messages.
Every message queue belongs to a Queue Manager.
Queue Manager administers queues, creates queues, and controls queue
operation.
MQSeries Message
MQSeries header contains data about the queue. Message header data includes
the message identification number, message format, and other message
descriptor data. In PowerCenterRT, MQSeries sources and dynamic MQSeries
targets automatically incorporate MQSeries message header fields.
MQSeries data component contains the application data or the "message body."
The content and format of the message data is defined by the application that
uses the message queue.
Extraction from a Queue
In order for PowerCenter to extract from a queue, the message must be in a form of
COBOL, XML, flat file or binary. When extracting from a queue, you need to use either
of two source qualifiers: MQ Source Qualifier (MQ SQ) or Associated Source Qualifier
(SQ).
You must use MQ SQ to read data from an MQ source, but you cannot use MQ SQ to
join to MQ sources. MQ SQ is predefined and comes with 29 message-headed fields.
MSGID is the primary key. After extracting from a queue, you can use a Midstream XML
Parser transformation to parse XML in a pipeline.
MQ SQ can perform the following tasks:
Select Associated Source Qualifier - this is necessary if the file is not binary.
Set Tracing Level - verbose, normal, etc.
Set Message Data Size - default 64,000; used for binary.
Filter Data - set filter conditions to filter messages using message header ports,
control end of file, control incremental extraction, and control syncpoint queue
clean up.
Use mapping parameters and variables
In addition, you can enable message recovery for sessions that fail when reading
messages from an MQSeries source, as well as use the Destructive Read attribute to
both remove messages from the source queue at synchronization points and evaluate
filter conditions when enabling message recovery.
With Associated SQ, either an Associated SQ (XML, flat file) or normalizer (COBOL) is
required if the data is not in binary. If you use an Associated SQ, be sure to design the
mapping as if it were not using MQ Series, and then add the MQ Source and Source
Qualifier after testing the mapping logic, joining them to the associated source qualifier.
When the code is working correctly, test by actually pulling data from the queue.
Loading to a Queue

Two types of MQ Targets can be used in a mapping: Static MQ Targets and Dynamic MQ
Targets. However, you can use only one type of MQ Target in a single mapping. You
can also use a Midstream XML Generator transformation to create XML inside a pipeline.
Static MQ Targets - Used for loading message data (instead of header data) to the
target. A static target does not load data to the message header fields. Use the
target definition specific to the format of the message data (i.e., flat file, XML,
COBOL). Design the mapping as if it were not using MQ Series, then configure
the target connection to point to a MQ message queue in the session when using
MQSeries.
Dynamic - Used for binary targets only, and when loading data to a message
header. Note that certain message headers in an MQSeries message require a
predefined set of values assigned by IBM.
Creating and Configuring MQSeries Sessions
After you create mappings in the Designer, you can create and configure sessions in the
Workflow Manager.
Configuring MQSeries Sources
The MQSeries source definition represents the metadata for the MQSeries source in the
repository. Unlike other source definitions, you do not create an MQSeries source
definition by importing the metadata from the MQSeries source. Since all MQSeries
messages contain the same message header and message data fields, the Designer
provides an MQSeries source definition with predefined column names.
MQSeries Mappings
MQSeries mappings cannot be partitioned if an associated source qualifier is used.
For MQ Series sources, set the Source Type to the following:
Heterogeneous when there is an associated source definition in the mapping. This
indicates that the source data is coming from an MQ source, and the message
data is in flat file, COBOL or XML format.
Message Queue when there is no associated source definition in the mapping.
Note that there are two pages on the Source Options dialog: XML and MQSeries. You
can alternate between the two pages to set configurations for each.
Configuring MQSeries Targets
For Static MQSeries Targets, select File Target type from the list. When the target is an
XML file or XML message data for a target message queue, the target type is
automatically set to XML.
1. If you load data to a dynamic MQ target, the target type is automatically set to
Message Queue.
2. On the MQSeries page, select the MQ connection to use for the source message
queue, and click OK.

3. Be sure to select the MQ checkbox in Target Options for the Associated file type.
Then click Edit Object Properties and type:
o the connection name of the target message queue.
o the format of the message data in the target queue (ex. MQSTR).
o the number of rows per message (only applies to flat file MQ targets).

TIP
Sessions can be run in real-time by using the ForcedEOQ(n) function ( or
similar functions like Idle(n), FlushLatency(n)) in a filter condition and
configuring the workflow to run continuously.
When the FOrcedEOQ(n) function is used, the PowerCenter Server stops
reading messages from the source at the end of the ForcedEOQ(n) period;
because it is set to run continuously, the session will automatically be
restarted. If the session needs to be run without stopping, then use the
following filter condition:
Idle(100000) && FlushLatency(3)
Appendix Information
PowerCenter uses the following datatypes in MQSeries mappings:
IBM MQSeries datatypes. IBM MQSeries datatypes appear in the MQSeries
source and target definitions in a mapping.
Native datatypes. Flat file, XML, or COBOL datatypes associated with an
MQSeries message data. Native datatypes appear in flat file, XML and COBOL
source definitions. Native datatypes also appear in flat file and XML target
definitions in the mapping.
Transformation datatypes. Transformation datatypes are generic datatypes that
PowerCenter uses during the transformation process. They appear in all the
transformations in the mapping.
IBM MQSeries Datatypes
MQSeries Datatypes Transformation
Datatypes
MQBYTE BINARY
MQCHAR STRING
MQLONG INTEGER
MQHEX

Values for Message Header Fields in MQSeries Target Messages
MQSeriesMessage
Header
Description

MQSeriesMessage
Header
Description
StrucId Structure identifier
Version Structure version number
Report Options for report messages
MsgType Message type
Expiry Message lifetime
Feedback Feedback or reason code
Encoding Data encoding
CodedCharSetId Coded character set identifier
Format Format name
Priority Message priority
Persistence Message persistence
MsgId Message identifier
CorrelId Correlation identifier
BackoutCount Backout counter
ReplytoZ Name of reply queue
ReplytoQMgr Name of reply queue manager
UserIdentifier Defined by the environment. If the
MQSeries server cannot determine this
value, the value for the field is null.
AccountingToken Defined by the environment. If the
MQSeries server cannot determine this
value, the value for the field is
MQACT_NONE.
ApplIdentityData Application data relating to identity.
The value for ApplIdentityData is null.
PutApplType Type of application that put the
message on queue. Defined by the
environment.
PutApplName Name of application that put the
message on queue. Defined by the
environment. If the MQSeries server
cannot determine this value, the value
for the field is null.
PutDate Date when the message arrives in the
queue.
PutTime Time when the message arrives in
queue.
ApplOrigData Application data relating to origin.
Value for ApplOriginData is null.
GroupId Group identifier
MsgSeqNumber Sequence number of logical messages
within group.
Offset Offset of data in physical message
from start of logical message.
MsgFlags Message flags
OrigialLength Length of original message


Data Connectivity using PowerCenter Connect for SAP
Challenge
Understanding how to install PowerCenter Connect for SAP R/3, extract data from SAP
R/3, build mappings, run sessions to load SAP R/3 data and load data to SAP R/3.
Description
SAP R/3 is a software system that integrates multiple business applications, such as
financial accounting, materials management, sales and distribution, and human
resources. The R/3 system is programmed in Advance Business Application
Programming-Fourth Generation (ABAP/4, or ABAP), a language proprietary to SAP.
PowerCenter Connect for SAP R/3 provides the ability to integrate SAP R/3 data into
data warehouses, analytic applications, and other applications. All of this is
accomplished without writing complex ABAP code. PowerCenter Connect for SAP R/3
generates ABAP programs on the SAP R/3 server. PowerCenter Connect for SAP R/3
extracts data from transparent tables, pool tables, cluster tables, hierarchies (Uniform
& Non Uniform), SAP IDocs and ABAP function modules.
When integrated with R/3 using ALE (Application Link Enabling), PowerCenter Connect
for SAP R/3 can also extract data from R/3 using outbound IDocs (Intermediate
Documents) in real time. The ALE concept available in R/3 Release 3.0 supports the
construction and operation of distributed applications. It incorporates the controlled
exchange of business data messages while ensuring data consistency across loosely
coupled SAP applications. The integration of various applications is achieved by using
synchronous and asynchronous communication, rather than by means of a central
database. PowerCenter Connect for SAP R/3 can change data in R/3, as well as load
new data into R/3 using direct RFC/BAPI function calls. It can also load data into SAP
R/3 using inbound IDocs.
The database server stores the physical tables in the R/3 system, while the application
server stores the logical tables. A transparent table definition on the application server
is represented by a single physical table on the database server. Pool and cluster tables
are logical definitions on the application server that do not have a one-to-one
relationship with a physical table on the database server.
Communication Interfaces

TCP/IP is the native communication interface between PowerCenter and SAP R/3. Other
interfaces between the two include:
Common Program Interface-Communications (CPI-C). CPI-C communication
protocol enables online data exchange and data conversion between R/3 and
PowerCenter. To initialize CPI-C communication with PowerCenter, SAP R/3 requires
information such as the host name of the application server and the SAP gateway. This
information is stored on the PowerCenter Server in a configuration file named sideinfo.
The PowerCenter Server uses parameters in the sideinfo file to connect to the R/3
system when running stream mode sessions.
Remote Function Call (RFC). RFC is the remote communication protocol used by SAP
and is based on RPC (Remote Procedure Call). To execute remote calls from
PowerCenter, SAP R/3 requires information such as the connection type and the service
name and gateway on the application server. This information is stored on the
PowerCenter Client and PowerCenter Server in a configuration file named saprfc.ini.
PowerCenter makes remote function calls when importing source definitions, installing
ABAP programs, and running file mode sessions.
Transport system. The transport system in SAP is a mechanism to transfer objects
developed on one system to another system. There are two situations when the
transport system is needed:
PowerCenter Connect for SAP R/3 installation.
Transport ABAP programs from development to production.
Note: if the ABAP programs are installed in the $TMP class, they cannot be transported
from development to production.
Security You must have proper authorizations on the R/3 system to perform
integration tasks. The R/3 administrator needs to create authorizations, profiles, and
users for PowerCenter users.
Integration Feature Authorization
Object
Activity
Import Definitions, Install
Programs
S_DEVELOP All activities. Also need to set
Development Object ID to PROG
Extract Data S_TABU_DIS

READ
Run File Mode Sessions S_DATASET WRITE

Submit Background Job S_PROGRAM BTCSUBMIT, SUBMIT
Release Background Job S_BTCH_JOB DELE, LIST, PLAN, SHOW
Also need to set Job Operation to
RELE
Run Stream Mode Sessions S_CPIC All activities
Authorize RFC privileges S_RFC All activities

You also need access to the SAP GUI, as described in following SAP GUI Parameters
table:
Parameter Feature references to this
variable
Comments
User ID $SAP_USERID Identify the username that
connects to the SAP GUI and is
authorized for read-only access
to the following transactions:
- SE12
- SE15
- SE16
- SPRO
Password $SAP_PASSWORD Identify the password for the
above user
System Number $SAP_SYSTEM_NUMBER Identify the SAP system number
Client Number $SAP_CLIENT_NUMBER Identify the SAP client number
Server $SAP_SERVER Identify the server on which this
instance of SAP is running
Key Capabilities of PowerCenter Connect for SAP R/3
Some key capabilities of PowerCenter Connect for SAP R/3 include:
Extract data from R/3 systems using ABAP, SAP's proprietary 4GL.
Extract data from R/3 using outbound IDocs or write data to R/3 using
inbound IDocs through integration with R/3 using ALE. You can extract
data from R/3 using outbound IDocs in real time.
Extract data from R/3 and load new data into R/3 using direct RFC/BAPI
function calls.
Migrate data from any source into R/3. You can migrate data from legacy
applications, other ERP systems, or any number of other sources into SAP R/3.
Extract data from R/3 and write it to a target data warehouse. PowerCenter
Connect for SAP R/3 can interface directly with SAP to extract internal data from
SAP R/3 and write it to a target data warehouse. You can then use the data
warehouse to meet mission critical analysis and reporting needs.
Support for calling BAPI as well as RFC functions dynamically from PowerCenter for
data integration. PowerCenter Connect for SAP R/3 can make BAPI as well as
RFC function calls dynamically from mappings to extract data from an R/3
source, transform data in the R/3 system, or load data into an R/3 system.
Support for data integration using ALE. PowerCenter Connect for SAP R/3 can
capture changes to the master and transactional data in SAP R/3 using ALE.
PowerCenter Connect for SAP R/3 can receive outbound IDocs from SAP R/3 in
real time and load into SAP R/3 using inbound IDocs. To receive IDocs in real
time using ALE, install PowerCenter Connect for SAP R/3 on PowerCenterRT.
Analytic Business Components for SAP R/3 (ABC) ABC is a set of business content
that enables rapid and easy development of the data warehouse based on R/3

data. ABC business content includes mappings, mapplets, source objects,
targets, and transformations.
Metadata Exchange PowerCenter Connect for SAP R/3 Metadata Exchange extracts
metadata from leading data modeling tools and imports it into PowerCenter
repositories through MX SDK.
Import SAP function in the Source Analyzer.
Import IDocs. PowerCenter Connect for SAP R/3 can create a transformation to
process outbound IDocs and generate inbound IDocs. PowerCenter Connect for
SAP R/3 can edit the transformation to modify the IDoc segments you want to
include. PowerCenter Connect for SAP R/3 can reorder and validate inbound
IDocs before writing them to the SAP R/3 system. PowerCenter Connect for SAP
R/3 can set partition points in a pipeline for outbound and inbound IDoc
sessions and sessions that fail when reading outbound IDocs from an SAP R/3
source can be configured for recovery. You can also receive data from outbound
IDoc files and write data to inbound IDoc files.
Insert ABAP Code Block to add more functionality to the ABAP program flow.
Use of outer join when two or more sources are joined in the ERP source qualifier.
Use of static filters to reduce return rows. (e.g. MARA = MARA-MATNR = 189)
Customization of the ABAP program flow with joins, filters, SAP functions, and
code blocks. For example: qualifying table = table1-field1 = table2-field2 where
the qualifying table is the last table in the condition based on the join order.
Creation of ABAP program variables to represent SAP R/3 structures, structure
fields, or values in the ABAP program
Removal of ABAP program information from SAP R/3 and the repository when a
folder is deleted.
Enhanced Platform support. PowerCenter Connect for SAP R/3 can be run on 64-
bit AIX and HP-UX (Itanium). You can install PowerCenter Connect for SAP R/3
for the PowerCenter Server and Repository Server on SuSe Linux. PowerCenter
Connect for SAP R/3 can be installed for the PowerCenter Server and Repository
Server on Red Hat Linux.
PowerCenter Connect for SAP R/3 can be connected with SAP's business content
framework to provide a high-volume data warehousing solution.

Installation and Configuration Steps
PowerCenter Connect for SAP R/3 setup programs install components for PowerCenter
Server, Client, and repository server. These programs install drivers, connection files,
and a repository plug-in XML file that enables integration between PowerCenter and
SAP R/3. Setup programs can also install PowerCenter Connect for SAP R/3 Analytic
Business Components, and PowerCenter Connect for SAP R/3 Metadata Exchange.
The Power Center Connect for SAP R/3 repository plug-in is called sapplg.xml. After the
plug-in is installed, it needs to be registered in the PowerCenter repository.
For SAP R/3
Informatica provides a group of customized objects required for R/3 integration. These
objects include tables, programs, structures, and functions that PowerCenter Connect
for SAP exports to data files The R/3 system administrator must use the transport
control program, tp import, to transport these object files on the R/3 system. The
transport process creates a development class called ZERP. The SAPTRANS directory

contains data and co files. The data files are the actual transport objects. The co
files are control files containing information about the transport request.
The R/3 system needs development objects and user profiles established to
communicate with PowerCenter. Preparing R/3 for integration involves the following
tasks:
Transport the development objects on the PowerCenter CD to R/3. PowerCenter
calls these objects each time it makes a request to the R/3 system.
Run the transport program that generate unique Ids.
Establish profiles in the R/3 system for PowerCenter users.
Create a development class for the ABAP programs that PowerCenter installs on
the SAP R/3 system.
For PowerCenter
The PowerCenter server and client need drivers and connection files to communicate
with SAP R/3. Preparing PowerCenter for integration involves the following tasks:
Run installation programs on PowerCenter Server and Client machines.
Configure the connection files:
o The sideinfo file on the PowerCenter Server allows PowerCenter to initiate
CPI-C with the R/3 system. Following are the required parameters for
sideinfo :
DEST logical name of the R/3 system
TYPE set to A to indicate connection to specific R/3 system.
ASHOST host name of the SAP R/3 application server.
SYSNR system number of the SAP R/3 application server.
o The saprfc.ini file on the PowerCenter Client and Server allows PowerCenter
to connect to the R/3 system as an RFC client. The required parameterts
for sideinfo are:
DEST logical name of the R/3 system
LU host name of the SAP application server machine
TP set to sapdp<system number>
GWHOST host name of the SAP gateway machine.
GWSERV set to sapgw<system number>
PROTOCOL set to I for TCP/IP connection.
Following is the summary of required steps:

1. Install PowerCenter Connect for SAP R/3 on PowerCenter.
2. Configure the sideinfo file.
3. Configure the saprfc.ini
4. Set the RFC_INI environment variable.
5. Configure an application connection for SAP R/3 sources in the Workflow
Manager.
6. Configure SAP/ALE IDoc connection in the Workflow Manager to receive IDocs
generated by the SAP R/3 system.
7. Configure the FTP connection to access staging files through FTP.
8. Install the repository plug-in in the PowerCenter repository.
Configuring the Services File
Windows
If SAPGUI is not installed, you must make entries in the Services file to run stream
mode sessions. This is found in the \WINNT\SYSTEM32\drivers\etc directory. Entries
are made similar to the following:
sapdp<system number> <port number of dispatcher service>/tcp
sapgw<system number> <port number of gateway service>/tcp
SAPGUI is not technically required, but experience has shown that evaluators typically
want to log into the R/3 system to use the ABAP workbench and to view table contents.
Unix
Services file is located in /etc
sapdp<system number> <port# of dispatcher service>/TCP
sapgw<system number> <port# of gateway service>/TCP
The system number and port numbers are provided by the BASIS administrator.
Configure Connections to Run Sessions
Informatica supports two methods of communication between the SAP R/3 system and
the PowerCenter Server.
Streaming Mode does not create any intermediate files on the R/3 system. This
method is faster, but it does use more CPU cycles on the R/3 system.
File Mode creates an intermediate file on the SAP R/3 system, which is then
transferred to the machine running the PowerCenter Server.
If you want to run file mode sessions, you must provide either FTP access or NFS
access from the machine running the PowerCenter Server to the machine running SAP
R/3. This, of course, assumes that PowerCenter and SAP R/3 are not running on the
same machine; it is possible to run PowerCenter and R/3 on the same system, but
highly unlikely.

If you want to use File mode sessions and your R/3 system is on a UNIX system, you
need to do one of the following:
Provide the login and password for the UNIX account used to run the SAP R/3
system.
Provide a login and password for a UNIX account belonging to same group as the
UNIX account used to run the SAP R/3 system.
Create a directory on the machine running SAP R/3, and run chmod g+s on that
directory. Provide the login and password for the account used to create this
directory.
Configure database connections in the Server Manager to access the SAP R/3 system
when running a session, then configure an FTP connection to access staging file through
FTP.
Extraction Process
R/3 source definitions can be imported from the logical tables using RFC protocol.
Extracting data from R/3 is a four-step process:
Import source definitions. The PowerCenter Designer connects to the R/3 application
server using RFC. The Designer calls a function in the R/3 system to import source
definitions.
Note: If you plan to join two or more than two tables in SAP, be sure you have the
optimized join conditions. Make sure you have identified your driving table (e.g., if you
plan to extract data from bkpf and bseg accounting tables, be sure to drive your
extracts from bkpf table.) There is a significant difference in performance if the joins
are properly defined.
Create a mapping. When creating a mapping using an R/3 source definition, you must
use an ERP source qualifier. In the ERP source qualifier, you can customize properties
of the ABAP program that the R/3 server uses to extract source data. You can also use
joins, filters, ABAP program variables, ABAP code blocks, and SAP functions to
customize the ABAP program.
Generate and install ABAP program. You can install two types of ABAP programs for
each mapping:
File mode. Extract data to file. The PowerCenter Server accesses the file through
FTP or NFS mount.
Stream Mode. Extract data to buffers. The PowerCenter Server accesses the
buffers through CPI-C, the SAP protocol for program-to-program
communication.
You can modify the ABAP program block and customize according to your requirements
(e.g., if you want to get data incrementally, create a mapping variable/parameter and
use it in the ABAP program).
Create session and run workflow

Stream Mode. In stream mode, the installed ABAP program creates buffers on
the application server. The program extracts source data and loads it into the
buffers. When a buffer fills, the program streams the data to the PowerCenter
Server using CPI-C. With this method, the PowerCenter Server can process data
when it is received.
File Mode. When running a session in file mode, the session must be configured
to access the file through NFS mount or FTP. When the session runs, the
installed ABAP program creates a file on the application server. The program
extracts source data and loads it into the file. When the file is complete, the
PowerCenter Server accesses the file through FTP or NFS mount and continues
processing the session.
Data Integration Using RFC/BAPI Functions
PowerCenter Connect for SAP R/3 can generate RFC/BAPI function mappings in the
Designer to extract data from SAP R/3, change data in R/3, or load data into R/3. When
it uses an RFC/BAPI function mapping in a workflow, the PowerCenter Server makes
the RFC function calls on R/3 directly to process the R/3 data. It doesnt have to
generate and install the ABAP program for data extraction.
Data Integration Using ALE
PowerCenter Connect for SAP R/3 can integrate PowerCenter with SAP R/3 using ALE.
With PowerCenter Connect for SAP R/3, PowerCenter can generate mappings in the
Designer to receive outbound IDocs from SAP R/3 in real time. It can also generate
mappings to send inbound IDocs to SAP for data integration. When PowerCenter uses
an inbound or outbound mapping in a workflow to process data in SAP R/3 using ALE, it
doesnt have to generate and install the ABAP program for data extraction.
Analytical Business Components
Analytic Business Components for SAP R/3 (ABC) allows you to use predefined business
logic to extract and transform R/3 data. It works in conjunction with PowerCenter and
PowerCenter Connect for SAP R/3 to extract master data, perform lookups, provide
documents, and other fact and dimension data from the following R/3 modules:
Financial Accounting
Controlling
Materials Management
Personnel Administration and Payroll Accounting
Personnel Planning and Development
Sales and Distribution
Refer to ABC Guide for complete installation and configuration information.


Data Profiling
Challenge
Data profiling is an option in PowerCenter version 7.0 and above that leverages existing
PowerCenter functionality and a data profiling GUI front-end to provide a wizard-driven
approach to creating data profiling mappings, sessions and workflows. This Best
Practice is intended to provide some introduction on usage for new users.
Description
Creating a Custom or Auto Profile
The data profiling option provides visibility into the data contained in source systems
and enables users to measure changes in the source data over time. This information
can help to improve the quality of the source data.
An auto profile is particularly valuable when you are data profiling a source for the first
time, since auto profiling offers a good overall perspective of a source. It provides a row
count, candidate key evaluation, and redundancy evaluation at the source level, and
domain inference, distinct value and null value count, and min, max, and average (if
numeric) at the column level. Creating and running an auto profile is quick and helps to
gain a reasonably thorough understanding of a source in a short amount of time.
A custom data profile is useful when there is a specific question about a source. Custom
profiling is useful for validating business rules and/or verifying that data matches a
particular pattern. For example, use custom profiling if you have a business rule that
you want to validate, or if you want to test whether data matches a particular pattern.
Setting Up the Profile Wizard
To customize the profile wizard for your preferences:
Open the Profile Manager and choose Tools > Options.
If you are profiling data using a database user that is not the owner of the tables
to be sourced, check the Use source owner name during profile mapping
generation option.
If you are in the analysis phase of your project, choose Always run profile
interactively since most of your data-profiling tasks will be interactive. (In later

phases of the project, uncheck this option since more permanent data profiles
are useful in these phases.)

Running and Monitoring Profiles
Profiles are run in one of two modes: interactive or batch. Choose the appropriate mode
by checking or unchecking Configure Session on the "Function-Level Operations tab
of the wizard.
Use Interactive to create quick, single-use data profiles. The sessions will be
created with default configuration parameters.
For data-profiling tasks that will be reused on a regular basis, create the sessions
manually in Workflow Manager and configure and schedule them appropriately.

Generating And Viewing Profile Reports in PowerCenter/PowerAnalyzer
Use Profile Manager to view profile reports. Right-click on a profile and choose View
Report.
For greater flexibility, you can also use PowerAnalyzer to view reports. Each
PowerCenter client includes a PowerAnalyzer schema and reports xml file. The xml files
can be found in the \Extensions\DataProfile\IPAReports subdirectory of the client
installation.
You can create additional metrics, attributes, and reports in PowerAnalyzer to meet
specific business requirements. You can also schedule PowerAnalyzer reports and
alerts to send notifications in cases where data does not meet preset quality limits.
Sampling Techniques
Four types of sampling techniques are available with the PowerCenter data profiling
option:
Technique Description Usage
No sampling Uses all source data Relatively small data sources
Automatic random
sampling
PowerCenter determines the
appropriate percentage to
sample, then samples random
rows.
Larger data sources where
you want a statistically
significant data analysis
Manual random
sampling
PowerCenter samples random
rows of the source data based on
a user-specified percentage.
Samples more or fewer rows
than the automatic option
chooses.

Sample first N rows Samples the number of user-
selected rows
Provides a quick readout of a
source (e.g., first 200 rows)
Profile Warehouse Administration

Updating Data Profiling Repository Statistics
The Data Profiling repository contains nearly 30 tables with more than 80 indexes. To
ensure that queries run optimally, be sure to keep database statistics up to date. Run
the following query below as appropriate for your database type. Then capture the
script that is generated and run it.
ORACLE
select 'analyze table ' || table_name || ' compute statistics;' from user_tables where
table_name like 'PMDP%';
select 'analyze index ' || index_name || ' compute statistics;' from user_tables where
index_name like 'DP%';
Microsoft SQL Server
select 'update statistics ' + name from sysobjects where name like 'PMDP%'
SYBASE
select 'update statistics ' + name from sysobjects where name like 'PMDP%'
INFORMIX
select 'update statistics low for table ', tabname, ' ; ' from systables where table_name
like 'PMDP%'
IBM DB2
select 'runstats on table ' || rtrim(tabschema) || '. ' || tabname || ' and indexes all; '
from syscat.tables where tabname like 'PMDP%'
TERADATA
select 'collect statistics on ', tablename, ' index ', indexname from dbc.indices where
tablename like 'PMDP%' and databasename = 'database_name'
where database_name is the name of the repository database.
Purging Old Data Profiles
Use the Profile Manager to purge old profile data from the Profile Warehouse. Choose
Target Warehouse>Connect and connect to the profiling warehouse. Choose Target
Warehouse>Purge to open the purging tool.


Data Quality Mapping Rules
Challenge
Use PowerCenter to create data quality mapping rules to enhance the usability of the
data within your system.
Description
This Best Practice focuses on techniques for use with PowerCenter and third-party or
add-on software. Comments that are specific to the use of PowerCenter are enclosed in
brackets.
Basic Methodology
The issue of poor data quality is one that frequently hinders the success of data
integration projects. It can produce inconsistent or faulty results and ruin the credibility
of the system with the business users. The data quality problems often arise from a
breakdown in overall process rather than a specific issue that can be resolved by a
single software package.
Some of the principles applied to data quality improvements are borrowed from
manufacturing where they were initially designed to reduce the costs of manufacturing
processes. A number of methodologies evolved from these principles, all centered
around the same general process: Define, Discover, Analyze, Improve, and Combine.
Reporting is a crucial part of each process step, helping to guide the users through the
process. Together, these steps offer businesses an iterative approach to improving data
quality.
Define This is the first step of any data quality exercise, and also the first step
to data profiling. Users must first define the goals of the exercise. Some
questions that should arise may include: 1) what are the troublesome data types
and in what domains do they reside? 2) what data elements are of concern? 3)
where do those data elements exist? 4) how are correctness and consistency
measured? and 5) are metadata definitions complete and consistent? This step
is often supplemented by a metadata solution that allows knowledgeable users
to see specific data elements across the enterprise. It also addresses the
question of where the data should be fixed, and how to ensure that the data is
fixed at the correct place. This step also helps to define the rules that users
subsequently employ to create data profiles.

Discover Since data profiling is a collection of statistics about existing data, the
next step in the process is to use the information gathered in the first step to
evaluate the actual data. This process should quantify how correct the data is
with regards to the predefined rules. These rules can be stored so that the
process can be performed iteratively and provide feedback on whether the data
quality is improving. Refer to the Best Practice on Data Profiling for a complete
description of how to use PowerCenters built-in data profiling capabilities.
Analyze The Analyze step takes the results of the Discover step and attempts to
identify the root causes of any data problems. Depending on the project, this
step may need to incorporate knowledge users from many different teams. This
step may also take more man hours than the other steps since much of the work
needs to be done by Business Analysts and/or Subject Matter Experts. The
issues should be prioritized so that the project team can address small chunks
of poor data quality at a time to ensure success. Since this process can be
repeated, there is no need to try and tackle the whole data quality problem in
one big bang.
Improve After the root causes of the data have been determined, steps should
be taken to scrub and clean the data. This step can be facilitated through the
use of specialized software packages that automate the clean-up process. This
includes the standardizing names, addresses, and formats. Data cleansing
software often uses other standard data sets provided by the software vendor to
match real addresses with people or companies, instead of relying on the
original values. Formats can also be cleaned up to remove any inconsistencies
introduced by multiple data entry clerks, such as those often found in product
IDs, telephone numbers, and other generic IDs or codes. Consistency rules can
be defined within the software package. It is sometimes advisable to profile the
data after the cleansing is complete to ensure that the software package has
effectively resolved the quality issues.
Combine Many enterprises have embarked on methods to identify a customer
or client identifier throughout the company. Since the data has been profiled and
cleansed at this point, the enterprise is now ready to start linking data elements
across source systems in order to reduce redundancy and increase consistency.
The rules that were defined in the Define step and leveraged in the Analyze step
play a big role here as well because master records are needed when removing
duplications.

Common Questions to Consider
Data integration/warehousing projects often encounter general data problems outside
the scope of a full-blown data quality project, but which also need to be addressed.
The remainder of this document discusses some methods to ensure a base level of data
quality; much of the content discusses specific strategies to use with PowerCenter.
The quality of data is important in all types of projects, whether it be data warehousing,
data synchronization, or data migration. Certain questions need to be considered for all
of these projects, with the answers driven by the projects requirements and the
business users that are being serviced. Ideally, these questions should be addressed
during the Design and Analyze phases of the project because they can require a
significant amount of re-coding if identified later.
Some of the areas to consider are:

Text formatting
The most common hurdle here is capitalization and trimming of spaces. Often, users
want to see data in its raw format without any capitalization, trimming, or formatting
applied to it. This is easily achievable as it is the default behavior, but there is danger in
taking this requirement literally since it can lead to duplicate records when some of
these fields are used to identify uniqueness and the system is combining data from
various source systems.
One solution to this issue is to create additional fields that act as a unique key to a
given table, but which are formatted in a standard way. Since the raw data is stored
in the table, users can still see it in this format, but the additional columns mitigate the
risk of duplication.
Another possibility is to explain to the users that raw data in unique, identifying fields
is not as clean and consistent as data in a common format. In other words, push back
on this requirement.
This issue can be particularly troublesome in data migration projects where matching
the source data is a high priority. Failing to trim leading/trailing spaces from data can
often lead to mismatched results since the spaces are stored as part of the data value.
The project team must understand how spaces are handled from the source systems to
determine the amount of coding required to correct this. (When using PowerCenter and
sourcing flat files, the options provided while configuring the File Properties may be
sufficient.). Remember that certain RDBMS products use the data type CHAR, which
then stores the data with trailing blanks. These blanks need to be trimmed before
matching can occur. It is usually only advisable to use CHAR for 1 character flag fields.

Note that many fixed-width files do not use a null as space. Therefore, developers must
put one space beside the text radio button, and also tell the product that the space is
repeating to fill out the rest of the precision of the column. The strip trailing blanks
facility then strips off any remaining spaces from the end of the data value. (In
PowerCenter, avoid embedding database text manipulation functions in lookup
transformations.). Embedding database text manipulation functions in lookup
transformations is not recommended because a developer must then cache the lookup
table due to the presence of a SQL override. On very large tables, caching is not always
realistic or feasible.
Datatype conversions
It is advisable to use explicit tool functions when converting the data type of a
particular data value.
[In PowerCenter, if the TO_CHAR function is not used, an implicit conversion is
performed, and 15 digits will be carried forward, even when they are not needed or
desired. PowerCenter can handle some conversions without function calls (these are
detailed in the product documentation), but this may cause subsequent support or
maintenance headaches.]
Dates
Dates can cause many problems when moving and transforming data from one place to
another because an assumption must be made that all data values are in a designated
format.
[Informatica recommends first checking a piece of data to ensure it is in the proper
format before trying to convert it to a Date data type. If the check is not performed
first, then a developer increases the risk of transformation errors, which can cause data
to be lost].
An example piece of code would be: IIF(IS_DATE(in_RECORD_CREATE_DT,
YYYYMMDD'), TO_DATE(in_RECORD_CREATE_DT, 'YYYYMMDD'), NULL)
If the majority of the dates coming from a source system arrive in the same format,
then it is often wise to create a reusable expression that handles dates, so that the
proper checks are made. It is also advisable to determine if any default dates should be
defined, such as a low date or high date. These should then be used throughout the
system for consistency. However, do not fall into the trap of always using default dates
as some are meant to be NULL until the appropriate time (e.g., birth date or death
date).
The NULL in the example above could be changed to one of the standard default dates
described here.
Decimal precision
With numeric data columns, developers must determine the expected or required
precisions of the columns. [By default (to increase performance), PowerCenter treats all

numeric columns as 15 digit floating point decimals, regardless of how they are defined
in the transformations. The maximum numeric precision in PowerCenter is 28 digits.]
If it is determined that precision of a column realistically needs a higher precision, then
the Enable Decimal Arithmetic in the Session Properties option needs to be checked.
However, be aware that enabling this option can slow performance by as much as 15
percent. The Enable Decimal Arithmetic option must be enabled when comparing two
numbers for equality.
Trapping Poor Data Quality Techniques
The most important technique for ensuring good data quality is to prevent incorrect,
inconsistent, or incomplete data from ever reaching the target system. This goal may
be difficult to achieve in a data synchronization or data migration project, but it is very
relevant when discussing data warehousing or ODS. This section discusses techniques
that you can use to prevent bad data from reaching the system.
Checking data for completeness before loading
When requesting a data feed from an upstream system, be sure to request an audit file
or report that contains a summary of what to expect within the feed. Common requests
here are record counts or summaries of numeric data fields. Assuming that this can be
obtained from the source system, it is advisable to then create a pre-process step that
ensures your input source matches the audit file. If the values do not match, stop the
overall process from loading into your target system. The source system can then be
alerted to verify where the problem exists in its feed.
Enforcing rules during mapping
Another method of filtering bad data is to have a set of clearly defined data rules built
into the load job. The records are then evaluated against these rules and routed to an
Error or Bad Table for further re-processing accordingly. An example of this is to check
all incoming Country Codes against a Valid Values table. If the code is not found, then
the record is flagged as an Error record and written to the Error table.
A pitfall of this method is that you must determine what happens to the record once it
has been loaded to the Error table. If the record is pushed back to the source system to
be fixed, then a delay may occur until the record can be successfully loaded to the
target system. In fact, if the proper governance is not in place, the source system may
refuse to fix the record at all. In this case, a decision must be made to either: 1) fix the
data manually and risk not matching with the source system; or 2) relax the business
rule to allow the record to be loaded.
Often times, in the absence of an enterprise data steward, it is a good idea to assign a
team member the role of data steward. It is this persons responsibility to patrol these
tables and push back to the appropriate systems as necessary, as well as help to make
decisions about fixing or filtering bad data. A data steward should have a good
command of the metadata, and he/she should also understand the consequences to the
user community of data decisions.

Another solution applicable in cases with a small number of code values is to try and
anticipate any mistyped error codes and translate them back to the correct codes. The
cross-reference translation data can be accumulated over time. Each time an error is
corrected, both the incorrect and correct values should be put into the table and used
to correct future errors automatically.
Dimension not found while loading fact
The majority of current data warehouses are built using a dimensional model. A
dimensional model relies on the presence of dimension records existing before loading
the fact tables. This can usually be accomplished by loading the dimension tables
before loading the fact tables. However, there are some cases where a corresponding
dimension record is not present at the time of the fact load. When this occurs,
consistent rules need to handle this so that data is not improperly exposed or hidden
to/from the users.
One solution is to continue to load the data to the fact table, but assign the foreign key
a value that represents Not Found or Not Available in the dimension. These keys must
also exist in the dimension tables to satisfy referential integrity, but they provide a
clear and easy way to identify records that may need to be reprocessed at a later date.
Another solution is to filter the record from processing since it may no longer be
relevant to the fact table. The team will most likely want to flag the row through the
use of either error tables or process codes so that it can be reprocessed at a later time.
A third solution is to use dynamic caches and load the dimensions when a record is not
found there, even while loading the fact table. This should be done very carefully as it
may add unwanted or junk values to the dimension table. One occasion when this may
be advisable is in cases where dimensions are simply made up of the distinct
combination values in a data set. Thus, this dimension may require a new record if a
new combination occurs.
It is imperative that all of these solutions be discussed with the users before making
any decisions, as they eventually will be the ones making decisions based on the
reports.


Deployment Groups
Challenge
Deployment groups is a versatile feature that offers an improved method of migrating
work completed in one repository to another repository. This Best Practice describes
ways deployment groups can be used to simplify migrations.
Description
Deployment Groups are containers that hold references to objects that need to be
migrated. This includes objects such as mappings, mapplets, reusable transformations,
sources, targets, workflows, sessions and tasks, as well as the object holders (i.e. the
repository folders). Deployment groups are faster and more flexible than folder moves
for incremental changes. In addition, they allow for migration rollbacks if necessary.
Migrating a deployment group allows you to copy objects in a single copy operation
from across multiple folders in the source repository into multiple folders in the target
repository. Copying a deployment group allows you to specify individual objects to
copy, rather than the entire contents of a folder.
There are two types of deployment groups: static and dynamic.
Static deployment groups contain direct references to versions of objects that
need to be moved. Users explicitly add the version of the object to be migrated
to the deployment group.
Dynamic deployment groups contain a query that is executed at the time of
deployment. The results of the query (i.e. object versions in the repository) are
then selected and copied to the target repository.
Dynamic deployment groups are generated from a query. While any available criteria
can be used, it is advisable to have developers use labels to simplify the query. See the
Best Practice on Using PowerCenter Labels , Strategies for Labels section, for further
information. When generating a query to for deployment groups with mappings and
mapplets that contain non reusable objects there is a query conditions that must be
used in addition to any specific selection criteria. The query must include a condition
for Is Reusable and use the qualifier one of Reusable and Non Reusable. Without this
the deployment may encounter errors if there are non reusable objects held within the
mapping or mapplet.

A deployment group exists in a specific repository. It can be used to move items to any
other accessible repository. A deployment group maintains a history of all migrations it
has performed. It tracks what versions of objects were moved from which folders in
which source repositories, and into which folders in which target repositories those
versions were copied (i.e. it provides a complete audit trail of all migrations
performed). Given that the deployment group knows what it moved and to where, then
if necessary, an administrator can have the deployment group undo the most recent
deployment, reverting the target repository to its pre-deployment state. Using labels
(as described in the Labels Best Practice) allows objects in the subsequent repository to
be tracked back to a specific deployment.
It is important to note that the deployment group only migrates the objects it contains
to the target repository. It does not, itself, move to the target repository. It still resides
in the source repository.
Deploying via the GUI
Migrations can be performed via the GUI or the command line (pmrep). To migrate
objects via the GUI, a user simply drags a deployment group from the repository it
resides in, onto the target repository where the objects it references are to be moved.
The Deployment Wizard appears, stepping the user through the deployment process.
The user can match folders in the source and target repositories so objects are moved
into the proper target folders, reset sequence generator values, etc. Once the wizard is
complete, the migration occurs, and the deployment history is created.
Deploying via the Command Line
The PowerCenter pmrep command can be used to automate both Folder Level
deployments (e.g. in a non-versioned repository) and deployments using Deployment
Groups. The commands DeployFolder and DeployDeploymentGroup in pmrep are
used respectively for these purposes. Whereas deployment via the GUI requires the
user to step through a wizard to answer the various questions to deploy, command-line
deployment requires the user to provide an XML control file, containing the same
information that is required by the wizard. This file must be present before the
deployment is executed.
Further Considerations for Deployment and Deployment Groups
Simultaneous Multi-Phase Projects
If there are multiple phases of a project being developed simultaneously in separate
folders, it is possible to consolidate them by mapping folders appropriately through the
deployment group migration wizard. When migrating with deployment groups in this
way, the override buttons in the migration wizard are used to select specific folder
mapping.
Rolling Back a Deployment
Deployment groups help to ensure that you have a back-out methodology. You can
rollback the latest version of a deployment. To do this:

In the target repository (where the objects were migrated to), go to
Versioning>>Deployment>>History>>View History>>Rollback.
The rollback purges all objects (of the latest version) that were in the deployment
group. You can initiate a rollback on a deployment as long as you roll back only the
latest versions of the objects. The rollback ensures that the check-in time for the
repository objects is the same as the deploy time.
Managing Repository Size
As you check in objects and deploy objects to target repositories, the number of object
versions in those repositories increases, and thus, the size of the repositories also
increases.
In order to manage repository size, use a combination of Check-in Date and Latest
Status (both are query parameters) to purge the desired versions from the repository
and retain only the very latest version. You could also choose to purge all the deleted
versions of the objects, which reduces the size of the repository.
If you want to keep more than the latest version, you can also include labels in your
query. These labels are ones that you have applied to the repository for the specific
purpose of identifying objects for purging.
Off-Shore On-Shore Migration
In an off-shore development environment to an on-shore migration situation, other
aspects of the computing environment may make it desirable to generate a dynamic
deployment group. Instead of migrating the group itself to the next repository, you can
use a query to select the objects for migration and save them to a single XML file which
can be then be transmitted to the on-shore environment though alternative methods. If
the on-shore repository is versioned, it will activate the import wizard as if a
deployment group was being received.
Migrating to a Non Versioned Repository
In some instances, it may be desirable to migrate to a non-versioned repository from a
versioned repository. It should be noted that this changes the wizards used when
migrating in this manner, and that the export from the versioned repository has to take
place using XML export. Note that certain repository objects (e.g. connections) cannot
be automatically migrated, which may invalidate objects such as sessions. These
objects (i.e. connections) should be set up first in the receiving repository. The XML
import wizard will advise of any invalidations that occur.


Designing Analytic Data Architectures
Challenge
Develop a sound data architecture that can serve as a foundation for an analytic
solution that may evolve over many years.
Description
Historically, organizations have approached the development of a "data warehouse" or
"data mart" as a departmental effort, without considering an enterprise perspective.
The result has been silos of corporate data and analysis, which very often conflict with
each other in terms of both detailed data and the business conclusions implied by it.
Taking an enterprise-wide, architect stance in developing analytic solutions provides
many advantages, including:
A sound architectural foundation ensures the solution can evolve and scale with
the business over time.
Proper architecture can isolate the application component (business context) of the
analytic solution from the technology.
Lastly, architectures allow for reuse - reuse of skills, design objects, and
knowledge.
As the evolution of analytic solutions (and the corresponding nomenclature) has
progressed, the necessity of building these solutions on a solid architectural framework
has become more and more clear. To understand why, a brief review of the history of
analytic solutions and their predecessors is warranted.
Historical Perspective
Online Transaction Processing Systems (OLTPs) have always provided a very detailed,
transaction-oriented view of an organization's data. While this view was indispensable
for the day-to-day operation of a business, its ability to provide a "big picture" view of
the operation, critical for management decision-making, was severely limited. Initial
attempts to address this problem took several directions:
Reporting directly against the production system. This approach minimized the
effort associated with developing management reports, but introduced a number of
significant issues:

The nature of OLTP data is, by definition, "point-in-time." Thus, reports run at different
times of the year, month, or even the day, were inconsistent with each other.
Ad hoc queries against the production database introduced uncontrolled performance
issues, resulting in slow reporting results and degradation of OLTP system performance.
Trending and aggregate analysis was difficult (or impossible) with the detailed data
available in the OLTP systems.
Mirroring the production system in a reporting database . While this
approach alleviated the performance degradation of the OLTP system, it did
nothing to address the other issues noted above.
Reporting databases . To address the fundamental issues associated with
reporting against the OLTP schema, organizations began to move toward
dedicated reporting databases. These databases were optimized for the types of
queries typically run by analysts, rather than those used by systems supporting
data entry clerks or customer service representatives. These databases may or
may not have included pre-aggregated data, and took several forms, including
traditional RDBMS as well as newer technology Online Analytical Processing
(OLAP) solutions.
The initial attempts at reporting solutions were typically point solutions; they were
developed internally to provide very targeted data to a particular department within the
enterprise. For example, the Marketing department might extract sales and
demographic data in order to infer customer purchasing habits. Concurrently, the Sales
department was also extracting sales data for the purpose of awarding commissions to
the sales force. Over time, these isolated silos of information became irreconcilable,
since the extracts and business rules applied to the data during the extract process
differed for the different departments
The result of this evolution was that the Sales and Marketing departments might report
completely different sales figures to executive management, resulting in a lack of
confidence in both departments' "data marts." From a technical perspective, the
uncoordinated extracts of the same data from the source systems multiple times placed
undue strain on system resources.
The solution seemed to be the "centralized" or "galactic" data warehouse. This
warehouse would be supported by a single set of periodic extracts of all relevant data
into the data warehouse (or Operational Data Store), with the data being cleansed and
made consistent as part of the extract process. The problem with this solution was its
enormous complexity, typically resulting in project failure. The scale of these failures
led many organizations to abandon the concept of the enterprise data warehouse in
favor of the isolated, "stovepipe" data marts described earlier. While these solutions
still had all of the issues discussed previously, they had the clear advantage of
providing individual departments with the data they needed without the
unmanageability of the enterprise solution.
As individual departments pursued their own data and analytical needs, they not only
created data stovepipes, they also created technical islands. The approaches to
populating the data marts and performing the analytical tasks varied widely, resulting
in a single enterprise evaluating, purchasing, and being trained on multiple tools and
adopting multiple methods for performing these tasks. If, at any point, the organization

did attempt to undertake an enterprise effort, it was likely to face the daunting
challenge of integrating the disparate data as well as the widely varying technologies.
To deal with these issues, organizations began developing approaches that considered
the enterprise-level requirements of an analytical solution.
Centralized Data Warehouse
The first approach to gain popularity was the centralized data warehouse. Designed to
solve the decision support needs for the entire enterprise at one time, with one effort,
the data integration process extracts the data directly from the operational systems. It
transforms the data according to the business rules and loads it into a single target
database serving as the enterprise-wide data warehouse.

Advantages
The centralized model offers a number of benefits to the overall architecture, including:
Centralized control . Since a single project drives the entire process, there is
centralized control over everything occurring in the data warehouse. This makes
it easier to manage a production system while concurrently integrating new
components of the warehouse.
Consistent metadata . Because the warehouse environment is contained in a
single database and the metadata is stored in a single repository, the entire
enterprise can be queried whether you are looking at data from Finance,
Customers, or Human Resources.
Enterprise view . Developing the entire project at one time provides a global
view of how data from one workgroup coordinates with data from others. Since
the warehouse is highly integrated, different workgroups often share common
tables such as customer, employee, and item lists.

High data integrity . A single, integrated data repository for the entire enterprise
would naturally avoid all data integrity issues that result from duplicate copies
and versions of the same business data.
Disadvantages
Of course, the centralized data warehouse also involves a number of drawbacks,
including:
Lengthy implementation cycle. With the complete warehouse environment
developed simultaneously, many components of the warehouse become
daunting tasks, such as analyzing all of the source systems and developing the
target data model. Even minor tasks, such as defining how to measure profit
and establishing naming conventions, snowball into major issues.
Substantial up-front costs . Many analysts who have studied the costs of this
approach agree that this type of effort nearly always runs into the millions.
While this level of investment is often justified, the problem lies in the delay
between the investment and the delivery of value back to the business.
Scope too broad . The centralized data warehouse requires a single database to
satisfy the needs of the entire organization. Attempts to develop an enterprise-
wide warehouse using this approach have rarely succeeded, since the goal is
simply too ambitious. As a result, this wide scope has been a strong contributor
to project failure.
Impact on the operational systems . Different tables within the warehouse
often read data from the same source tables, but manipulate it differently before
loading it into the targets. Since the centralized approach extracts data directly
from the operational systems, a source table that feeds into three different
target tables is queried three times to load the appropriate target tables in the
warehouse. When combined with all the other loads for the warehouse, this can
create an unacceptable performance hit on the operational systems.

Independent Data Mart
The second warehousing approach is the independent data mart, which gained
popularity in 1996 when DBMS magazine ran a cover story featuring this strategy. This
architecture is based on the same principles as the centralized approach, but it scales
down the scope from solving the warehousing needs of the entire company to the
needs of a single department or workgroup.
Much like the centralized data warehouse, an independent data mart extracts data
directly from the operational sources, manipulates the data according to the business
rules, and loads a single target database serving as the independent data mart. In
some cases, the operational data may be staged in an Operational Data Store (ODS)
and then moved to the mart.


Advantages
The independent data mart is the logical opposite of the centralized data warehouse.
The disadvantages of the centralized approach are the strengths of the independent
data mart:
Impact on operational databases localized . Because the independent data
mart is trying to solve the DSS needs of a single department or workgroup, only
the few operational databases containing the information required need to be
analyzed.
Reduced scope of the data model . The target data modeling effort is vastly
reduced since it only needs to serve a single department or workgroup, rather
than the entire company.
Lower up-front costs . The data mart is serving only a single department or
workgroup; thus hardware and software costs are reduced.
Fast implementation . The project can be completed in months, not years. The
process of defining business terms and naming conventions is si mplified since
"players from the same team" are working on the project.
Disadvantages
Of course, independent data marts also have some significant disadvantages:
Lack of centralized control . Because several independent data marts are
needed to solve the decision support needs of an organization, there is no
centralized control. Each data mart or project controls itself, but there is no
central control from a single location.
Redundant data . After several data marts are in production throughout the
organization, all of the problems associated with data redundancy surface, such

as inconsistent definitions of the same data object or timing differences that
make reconciliation impossible.
Metadata integration . Due to their independence, the opportunity to share
metadata - for example, the definition and business rules associated with the
Invoice data object - is lost. Subsequent projects must repeat the development
and deployment of common data objects.
Manageability . The independent data marts control their own scheduling
routines and therefore store and report their metadata differently, with a
negative impact on the manageability of the data warehouse. There is no
centralized scheduler to coordinate the individual loads appropriately or
metadata browser to maintain the global metadata and share development work
among related projects.
Dependent Data Marts (Federated Data Warehouses)
The third warehouse architecture is the dependent data mart approach supported by
the hub-and-spoke architecture of PowerCenter and PowerMart. After studying more
than one hundred different warehousing projects, Informatica introduced this approach
in 1998, leveraging the benefits of the centralized data warehouse and independent
data mart.
The more general term being adopted to describe this approach is the "federated data
warehouse." Industry analysts have recognized that, in many cases, there is no "one
size fits all" solution. Although the goal of true enterprise architecture, with conformed
dimensions and strict standards, is laudable, it is often impractical, particularly for early
efforts. Thus, the concept of the federated data warehouse was born. It allows for the
relatively independent development of data marts, but leverages a centralized
PowerCenter repository for sharing transformations, source and target objects, business
rules, etc.
Recent literature describes the federated architecture approach as a way to get closer
to the goal of a truly centralized architecture while allowing for the practical realities of
most organizations. The centralized warehouse concept is sacrificed in favor of a more
pragmatic approach, whereby the organization can develop semi-autonomous data
marts, so long as they subscribe to a common view of the business. This common
business model is the fundamental, underlying basis of the federated architecture, since
it ensures consistent use of business terms and meanings throughout the enterprise.
With the exception of the rare case of a truly independent data mart, where no future
growth is planned or anticipated, and where no opportunities for integration with other
business areas exist, the federated data warehouse architecture provides the best
framework for building an analytic solution.
Informatica's PowerCenter and PowerMart products provide an essential capability for
supporting the federated architecture: the shared Global Repository. When used in
conjunction with one or more Local Repositories, the Global Repository serves as a sort
of "federal" governing body, providing a common understanding of core business
concepts that can be shared across the semi-autonomous data marts. These data marts
each have their own Local Repository, which typically include a combination of purely
local metadata and shared metadata by way of links to the Global Repository.


This environment allows for relatively independent development of individual data
marts, but also supports metadata sharing without obstacles. The common business
model and names described above can be captured in metadata terms and stored in the
Global Repository. The data marts use the common business model as a basis, but
extend the model by developing departmental metadata and storing it locally.
A typical characteristic of the federated architecture is the existence of an Operational
Data Store (ODS). Although this component is optional, it can be found in many
implementations that extract data from multiple source systems and load multiple
targets. The ODS was originally designed to extract and hold operational data that
would be sent to a centralized data warehouse, working as a time-variant database to
support end-user reporting directly from operational systems. A typical ODS had to be
organized by data subject area because it did not retain the data model from the
operational system.
Informatica's approach to the ODS, by contrast, has virtually no change in data model
from the operational system, so it need not be organized by subject area. The ODS
does not permit direct end-user reporting, and its refresh policies are more closely
aligned with the refresh schedules of the enterprise data marts it may be feeding. It
can also perform more sophisticated consolidation functions than a traditional ODS.
Advantages
The Federated architecture brings together the best features of the centralized data
warehouse and independent data mart:
Room for expansion . While the architecture is designed to quickly deploy the
initial data mart, it is also easy to share project deliverables across subsequent
data marts by migrating local metadata to the Global Repository. Reuse is built
in.

Centralized control . A single platform controls the environment from
development to test to production. Mechanisms to control and monitor the data
movement from operational databases into the analytic environment are applied
across the data marts, easing the system management task.
Consistent metadata . A Global Repository spans all the data marts, providing a
consistent view of metadata.
Enterprise view . Viewing all the metadata from a central location also provides
an enterprise view, easing the maintenance burden for the warehouse
administrators. Business users can also access the entire environment when
necessary (assuming that security privileges are granted).
High data integrity . Using a set of integrated metadata repositories for the
entire enterprise removes data integrity issues that result from duplicate copies
of data.
Minimized impact on operational systems . Frequently accessed source data,
such as customer, product, or invoice records is moved into the decision support
environment once, leaving the operational systems unaffected by the number of
target data marts.
Disadvantages
Disadvantages of the federated approach include:
Data propagation . This approach moves data twice-to the ODS, then into the
individual data mart. This requires extra database space to store the staged data
as well as extra time to move the data. However, the disadvantage can be
mitigated by not saving the data permanently in the ODS. After the warehouse
is refreshed, the ODS can be truncated, or a rolling three months of data can be
saved.
Increased development effort during initial installations . For each table in
the target, there needs to be one load developed from the ODS to the target, in
addition to all the loads from the source to the targets.
Operational Data Store
Using a staging area or ODS differs from a centralized data warehouse approach since
the ODS is not organized by subject area and is not customized for viewing by end
users or even for reporting. The primary focus of the ODS is in providing a clean,
consistent set of operational data for creating and refreshing data marts. Separating
out this function allows the ODS to provide more reliable and flexible support.
Data from the various operational sources is staged for subsequent extraction by target
systems in the ODS. In the ODS, data is cleaned and remains normalized, tables from
different databases are joined, and a refresh policy is carried out (a change/capture
facility may be used to schedule ODS refreshes, for instance).
The ODS and the data marts may reside in a single database or be distributed across
several physical databases and servers.
Characteristics of the Operational Data Store are:
Normalized

Detailed (not summarized)
Integrated
Cleansed
Consistent
Within an enterprise data mart, the ODS can consolidate data from disparate systems
in a number of ways:
Normalizes data where necessary (such as non-relational mainframe data),
preparing it for storage in a relational system.
Cleans data by enforcing commonalties in dates, names and other data types that
appear across multiple systems.
Maintains reference data to help standardize other formats; references might
range from zip codes and currency conversion rates to product-code-to-product-
name translations. The ODS may apply fundamental transformations to some
database tables in order to reconcile common definitions, but the ODS is not
intended to be a transformation processor for end-user reporting requirements.
Its role is to consolidate detailed data within common formats. This enables users to
create wide varieties of analytical reports, with confidence that those reports will be
based on the same detailed data, using common definitions and formats.
The following table compares the key differences in the three architectures:
Architecture Centralized Data
Warehouse
Independent
Data Mart
Federated
Data Warehouse
Centralized
Control
Yes No Yes
Consistent
Metadata
Yes No Yes
Cost effective No Yes Yes
Enterprise View Yes No Yes
Fast
Implementation
No Yes Yes
High Data
Integrity
Yes No Yes
Immediate ROI No Yes Yes
Repeatable
Process
No Yes Yes
The Role of Enterprise Architecture
The federated architecture approach allows for the planning and implementation of an
enterprise architecture framework that addresses not only short-term departmental
needs, but also the long-term enterprise requirements of the business. This does not
mean that the entire architectural investment must be made in advance of any
application development. However, it does mean that development is approached
within the guidelines of the framework, allowing for future growth without significant
technological change. The remainder of this chapter will focus on the process of

designing and developing an analytic solution architecture using PowerCenter as the
platform.
Fitting Into the Corporate Architecture
Very few organizations have the luxury of creating a "green field" architecture to
support their decision support needs. Rather, the architecture must fit within an
existing set of corporate guidelines regarding preferred hardware, operating systems,
databases, and other software. The Technical Architect, if not already an employee of
the organization, should ensure that he/she has a thorough understanding of the
existing (and future vision of) technical infrastructure. Doing so will eliminate the
possibility of developing an elegant technical solution that will never be implemented
because it defies corporate standards.


Developing an Integration Competency Center
Challenge
With increased pressure on IT productivity, many companies are rethinking the
independence of data integration projects that has resulted in inefficient, piecemeal or
silo-based approach to each new project. Furthermore, as each group within a business
attempts to integrate its data, it unknowingly duplicates effort the company has already
invested-not just in the data integration itself, but also the effort spent on developing
practices, processes, code, and personnel expertise.
An alternative to this expensive redundancy is to create some type of integration
competency center (ICC). An ICC is an IT approach that provides teams throughout an
organization with best practices in integration skills, processes, and technology so that
they can complete data integration projects consistently, rapidly, and cost-efficiently.
What types of services should your ICC offer? This BP provides an overview of offerings
to help you consider the appropriate structure for your ICC.
Description

Objectives
Typical ICC objectives include:
Promoting data integration as a formal discipline
Developing a set of experts with data integration skills and processes, and
leveraging their knowledge across the organization
Building and developing skills, capabilities, and best practices for integration
processes and operations
Monitoring, assessing, and selecting integration technology and tools
Managing integration pilots
Leading and supporting integration projects with the cooperation of subject matter
experts
Reusing development work such as source definitions, application interfaces, and
codified business rules

Benefits

Although a successful project that shares its lessons learned with other teams can be a
great way to begin developing organizational awareness of the value of an ICC, to set
up a more formal ICC will require upper management buy-in and funding. Here are
some of the typical benefits that can be realized from doing so:
Rapid development of in-house expertise through coordinated training and shared
knowledge
Leverage of shared resources and best practice methods and solutions
More rapid project deployments
Higher quality / reduced risk data integration projects
Reduced costs of project development and maintenance
When examining the move toward an ICC model that optimizes and in certain situations
centralizes integration functions, consider two things: the problems, costs and risks
associated with a project silo-based approach, and the potential benefits of an ICC
environment.
What Services should be in Your ICC?
The common services provided by ICCs can be divided into 4 major categories:
Knowledge Management
Environment
Development Support
Production Support

Detailed Services Listings by Category
Knowledge Management

Training
o Standards Training (Training Coordinator)

Training of best practices, including but not limited to, naming
conventions, unit test plans, configuration management strategy, and
project methodology.
o Product Training (Training Coordinator)

Co-ordination of vendor-offered or internally-sponsored training of
specific technology products.
Standards
o Standards Development (Knowledge Coordinator)

Creating best practices, including but not limited to, naming conventions,
unit test plans, and coding standards.
o Standards Enforcement (Knowledge Coordinator)

Enforcing development teams to use documented best practices through

formal development reviews, metadata reports, project audits or other
means.
o Methodology (Knowledge Coordinator)

Creating methodologies to support development initiatives. Examples
include methodologies for rolling out data warehouses and data
integration projects. Typical topics in a methodology include, but are not
limited to:
Project Management
Project Estimation
Development Standards
Operational Support
o Mapping Patterns (Knowledge Coordinator)

Developing and maintaining mapping patterns (templates) to speed up
development time and promote mapping standards across projects.
Technology
o Emerging Technologies (Technology Leader )

Assessing emerging technologies and determining if/where they fit in the
organization and policies around their adoption/use
o Benchmarking (Technology Leader)

Conducting and documenting tests on hardware and software in the
organization to establish performance benchmarks
Metadata
o Metadata Standards (Metadata Administrator)

Creating standards for capturing and maintaining metadata. (Example:
database column descriptions will be captured in ErWin and pushed to
PowerCenter via Metadata Exchange)
o Metadata Enforcement (Metadata Administrator)

Enforcing development teams to conform to documented metadata
standards
o Data Integration Catalog (Metadata Administrator)

Tracking the list of systems involved in data integration efforts, the
integration between systems, and the use of/subscription to data
integration feeds. This information is critical to managing the
interconnections in the environment in order to avoid duplication of
integration efforts. The Calalog will also assist in understanding when
particular integration feeds are no longer needed.
Environment

Hardware

o Vendor Selection and Management (Vendor Manager)

Selecting vendors for the hardware tools needed for integration efforts
that may span Servers, Storage and network facilities
o Hardware Procurement (Vendor Manager)

Responsible for the purchasing process for hardware items that may
include receiving and cataloging the physical hardware items.
o Hardware Architecture (Technical Architect)

Developing and maintaining the physical layout and details of the
hardware used to support the Integration Competency Center
o Hardware Installation (Product Specialist)

Setting up and activating new hardware as it becomes part of the
physical architecture supporting the Integration Competency Center
o Hardware Upgrades (Product Specialist)

Managing the upgrade of hardware including operating system patches,
additional cpu/memory upgrades, replacing old technology etc.
Software

Selecting vendors for the software tools needed for integration efforts.
Activities may include formal RFPs, vendor presentation reviews,
software selection criteria, maintenance renewal negotiations and all
activities related to managing the software vendor relationship.
o Software Procurement (Vendor Manager)

Responsible for the purchasing process for software packages and
licenses
o Software Architecture (Technical Architect)

Developing and maintaining the architecture of the software package(s)
used in the competency center. This may include flowcharts and decision
trees of what software to select for specific tasks.
o Software Installation (Product Specialist)

Setting up and installing new software as it becomes part of the physical
architecture supporting the Integration Competency Center
o Software Upgrade (Product Specialist)

Managing the upgrade of software including patches and new releases.
Depending on the nature of the upgrade, significant planning and rollout
efforts may be required during upgrades. (Training, testing, physical
installation on client machines etc)
o Compliance (Licensing) (Vendor Manager)

Monitoring and ensuring proper licensing compliance across development
teams. Formal audits or reviews may be scheduled. Physical
documentation should be kept matching installed software with
purchased licenses.
Professional Services


Selecting vendors for professional services efforts related to integration
efforts. Activities may include managing vendor rates and bulk discount
negotiations, payment of vendors, reviewing past vendor work efforts,
managing list of preferred vendors etc.
o Vendor Qualification (Vendor Manager)

Conducting formal vendor interviews as consultants/ contracts are
proposed for projects, checking vendor references and certifications,
formally qualifying selected vendors for specific work tasks (i.e., Vendor
A is qualified for Java development while Vendor B is qualified for ETL
and EAI work)
Security
o Security Administration (Security Administrator)

Providing access to the tools and technology needed to complete data
integration development efforts including software user ids, source
system user id/passwords, and overall data security of the integration
efforts. Ensures enterprise security processes are followed.
o Disaster Recovery (Technical Architect)

Performing risk analysis in order to develop and execute a plan for
disaster recovery including repository backups, off-site backups, failover
hardware, notification procedures and other tasks related to a
catastrophic failure (ie server room fire destroys dev/prod servers).
Financial
o Budget (ICC Manager)

Yearly budget management for the Integration Competency Center.
Responsible for managing outlays for services, support, hardware,
software and other costs.
o Departmental Cost Allocation (ICC Manager)

For clients where shared services costs are to be spread across
departments/ business units for cost purposes. Activities include defining
metrics uses for cost allocation, reporting on the metrics, and applying
cost factors for billing on a weekly/monthly or quarterly basis as dictated.
Scalability/Availability
o High Availability (Technical Architect)

Designing and implementing hardware, software and procedures to
ensure high availability of the data integration environment.
o Capacity Planning (Technical Architect)

Designing and planing for additional integration capacity to address the
growth in size and volume of data integration in the future for the
organization.
Development Support

Performance
o Performance and Tuning (Product Specialist)

Providing targeted performance and tuning assistance for integration
efforts. Providing on-going assessments of load windows and schedules
to ensure service level agreements are being met.
Shared Objects
o Shared Object Quality Assurance (Quality Assurance)

Providing quality assurance services for shared objects so that objects
conform to standards and do not adversely affect the various projects
that may be using them.
o Shared Object Change Management (Change Control Coordinator)

Managing the migration to production of shared objects which may
impact multiple project teams. Activities include defining the schedule
for production moves, notifying teams of changes, and coordinating the
migration of the object to production.
o Shared Object Acceptance (Change Control Coordinator)

Defining and documenting the criteria for a shared object and officially
certifying an object as one that will be shared across project teams.
o Shared Object Documentation (Change Control Coordinator)

Defining the standards for documentation of shared objects and
maintaining a catalog of all shared objects and their functions.
Project Support
o Development Helpdesk (Data Integration Developer)

Providing a helpdesk of expert product personnel to support project
teams. This will provide project teams new to developing data
integration routines with a place to turn to for experienced guidance.
o Software/Method Selection (Technical Architect)

Providing a workflow or decision tree to use when deciding which data
integration technology to use for a given technology request.
o Requirements Definition (Business/Technical Analyst)

Developing the process to gather and document integration
requirements. Depending on the level of service, activity may include
assisting or even fully gathering the requirements for the project.
o Project Estimation (Project Manager)

Developing project estimation models and provide estimation assistance
for data integration efforts.
o Project Management (Project Manager)

Providing full time management resources experienced in data
integration to ensure successful projects.
o Project Architecture Review (Data Integration Architect)

Providing project level architecture review as part of the design process

for data integration projects. Helping ensure standards are met and the
project architecture fits within the enterprise architecture vision.
o Detailed Design Review (Data Integration Developer)

Reviewing design specifications in detail to ensure conformance to
standards and identifying any issues upfront before development work is
begun.
o Development Resources (Data Integration Developer)

Providing product-skilled resources for completion of the development
efforts.
o Data Profiling (Data Integration Developer)

Providing data profiling services to identify data quality issues. Develop
plans for addressing issues found in data profiling.
o Data Quality (Data Integration Developer)

Defining and meeting data quality levels and thresholds for data
integration efforts.
Testing
o Unit Testing (Quality Assurance )

Defining and executing unit testing of data integration processes.
Deliverables include documented test plans, test cases and verification
against end user acceptance criteria.
o System Testing (Quality Assurance)

Defining and performing system testing to ensure that data integration
efforts work seamlessly across multiple projects and teams.
Cross Project Integration
o Schedule Management/Planning (Data Integration Developer)

Providing a single point for managing load schedules across the physical
architecture to make best use of available resources and appropriately
handle integration dependencies.
o Impact Analysis (Data Integration Developer)

Providing impact analysis on proposed and scheduled changes that may
impact the integration environment. Changes include but are not limited
to system enhancements, new systems, retirement of old systems, data
volume changes, shared object changes, hardware migration and system
outages.
Production Support

Issue Resolution
o Operations Helpdesk (Production Operator)

First line of support for operations issues providing high level issue

resolution. Helpdesk would field support cases and issues related to
scheduled jobs, system availability and other production support tasks.
o Data Validation (Quality Assurance)

Providing data validation on integration load tasks. Data may be held
from end user access until some level of data validation has been
performed. It might be manual review of load statistics - to automated
review of record counts including grand total comparisons, expected size
thresholds or any other metric an organization may define to catch
potential data inconsistencies before reaching end users.
Production Monitoring
o Schedule Monitoring (Production Operator)

Nightly/daily monitoring of the data integration load jobs. Ensuring jobs
are properly initiated, are not being delayed, and ensuring successful
completion. May provide first level support to the load schedule while
escalating issues to the appropriate support teams.
o Operations Metadata Delivery (Production Operator)

Responsible for providing metadata to system owners and end users
regarding the production load process including load times, completion
status, known issues and other pertinent information regarding the
current state of the integration job stream.
Change Management
o Object Migration (Change Control Coordinator)

Coordinating movement of development objects and processes to
production. May even physically control migration such that all migration
is scheduled, managed, and performed by the ICC.
o Change Control Review (Change Control Coordinator)

Conducting formal and informal reviews of production changes before
migration is approved. At this time, standards may be enforced, system
tuning reviewed, production schedules updated, and formal sign off to
production changes is issued.
o Process Definition (Change Control Coordinator)

Developing and documenting the change management process such that
development objects are efficiently and flawlessly migrated into the
production environment. This may include notification rules, schedule
migration plans, emergency fix procedures etc.


Development FAQs
Challenge
Using the PowerCenter product suite to effectively develop, name, and document
components of the analytic solution. While the most effective use of PowerCenter
depends on the specific situation, this Best Practice addresses some questions that are
commonly raised by project teams. It provides answers in a number of areas, including
Scheduling, Backup Strategies, Server Administration, and Metadata. Refer to the
product guides supplied with PowerCenter for additional information.
Description
The following pages summarize some of the questions that typically arise during
development and suggest potential resolutions.
Q: How does source format affect performance? (i.e., is it more efficient to source from
a flat file rather than a database?)
In general, a flat file that is located on the server machine loads faster than a database
located on the server machine. Fixed-width files are faster than delimited files because
delimited files require extra parsing. However, if there is an intent to perform intricate
transformations before loading to target, it may be advisable to first load the flat file
into a relational database, which allows the PowerCenter mappings to access the data
in an optimized fashion by using filters and custom SQL SELECTs where appropriate.
Q: What are some considerations when designing the mapping? (i.e. what is the impact
of having multiple targets populated by a single map?)
With PowerCenter, it is possible to design a mapping with multiple targets. You can
then load the targets in a specific order using Target Load Ordering. The
recommendation is to limit the amount of complex logic in a mapping. Not only is it
easier to debug a mapping with a limited number of objects, but such mappings can
also be run concurrently and make use of more system resources. When using multiple
output files (targets), consider writing to multiple disks or file systems simultaneously.
This minimizes disk seeks and applies to a session writing to multiple targets, and to
multiple sessions running simultaneously.
Q: What are some considerations for determining how many objects and
transformations to include in a single mapping?

There are several items to consider when building a mapping. The business
requirement is always the first consideration, regardless of the number of objects it
takes to fulfil the requirement. The most expensive use of the DTM is passing
unnecessary data through the mapping. It is best to use filters as early as possible in
the mapping to remove rows of data that are not needed. This is the SQL equivalent of
the WHERE clause. Using the filter condition in the Source Qualifier to filter out the
rows at the database level is a good way to increase the performance of the mapping.
Log File Organization
Q: Where is the best place to maintain Session Logs?
One often-recommended location is the default "SessLogs" folder in the PowerCenter
directory, keeping all log files in the same directory.
Q: What documentation is available for the error codes that appear within the error log
files?
Log file errors and descriptions appear in Appendix C of the PowerCenter Trouble
Shooting Guide. Error information also appears in the PowerCenter Help File within
the PowerCenter client applications. For other database-specific errors, consult your
Database User Guide.
Scheduling Techniques
Q: What are the benefits of using workflows with multiple tasks rather than a workflow
with a stand-alone session?
Using a workflow to group logical sessions minimizes the number of objects that must
be managed to successfully load the warehouse. For example, a hundred individual
sessions can be logically grouped into twenty workflows. The Operations group can then
work with twenty workflows to load the warehouse, which simplifies the operations
tasks associated with loading the targets.
Workflows can be created to run sequentially or concurrently, or have tasks in different
paths doing either.
A sequential workflow runs sessions and tasks one at a time, in a linear sequence.
Sequential workflows help ensure that dependencies are met as needed. For
example, a sequential workflow ensures that session1 runs before session2
when session2 is dependent on the load of session1, and so on. It's also possible
to set up conditions to run the next session only if the previous session was
successful, or to stop on errors, etc.
A concurrent workflow groups logical sessions and tasks together, like a sequential
workflow, but runs all the tasks at one time. This can reduce the load times into
the warehouse, taking advantage of hardware platforms' Symmetric Multi-
Processing (SMP) architecture.
Other workflow options, such as nesting worklets within workflows, can further reduce
the complexity of loading the warehouse. However, this capability allows for the

creation of very complex and flexible workflow streams without the use of a third-party
scheduler.
Q: Assuming a workflow failure, does PowerCenter allow restart from the point of
failure?
No. When a workflow fails, you can choose to start a workflow from a particular task
but not from the point of failure. It is possible, however, to create tasks and flows
based on error handling assumptions.
Q: What guidelines exist regarding the execution of multiple concurrent sessions /
workflows within or across applications?
Workflow Execution needs to be planned around two main constraints:
Available system resources
Memory and processors
The number of sessions that can run at one time depends on the number of processors
available on the server. The load manager is always running as a process. As a general
rule, a session will be compute-bound, meaning its throughput is limited by the
availability of CPU cycles. Most sessions are transformation intensive, so the DTM
always runs. Also, some sessions require more I/O, so they use less processor time.
Generally, a session needs about 120 percent of a processor for the DTM, reader, and
writer in total.
For concurrent sessions:
One session per processor is about right; you can run more, but that requires a "trial
and error" approach to determine what number of sessions starts to affect session
performance and possibly adversely affect other executing tasks on the server.
The sessions should run at "off-peak" hours to have as many available resources as
possible.
Even after available processors are determined, it is necessary to look at overall system
resource usage. Determining memory usage is more difficult than the processors
calculation; it tends to vary according to system load and number of PowerCenter
sessions running.
The first step is to estimate memory usage, accounting for:
Operating system kernel and miscellaneous processes
Database engine
Informatica Load Manager
The DTM process creates threads to initialize the session, read, write and transform
data, and handle pre- and post-session operations.
More memory is allocated for lookups, aggregates, ranks, sorters and
heterogeneous joins in addition to the shared memory segment.

At this point, you should have a good idea of what is left for concurrent sessions. It is
important to arrange the production run to maximize use of this memory. Remember to
account for sessions with large memory requirements; you may be able to run only one
large session, or several small sessions concurrently.
Load Order Dependencies are also an important consideration because they often
create additional constraints. For example, load the dimensions first, then facts. Also,
some sources may only be available at specific times, some network links may become
saturated if overloaded, and some target tables may need to be available to end users
earlier than others.
Q: Is it possible to perform two "levels" of event notification? At the application level
and the PowerCenter Server level to notify the Server Administrator?
The application level of event notification can be accomplished through post-session
email. Post-session email allows you to create two different messages; one to be sent
upon successful completion of the session, the other to be sent if the session fails.
Messages can be a simple notification of session completion or failure, or a more
complex notification containing specifics about the session. You can use the following
variables in the text of your post-session email:

Email Variable Description
%s Session name
%l Total records loaded
%r Total records rejected
%e Session status
%t Table details, including read throughput in bytes/second and write
throughput in rows/second
%b Session start time
%c Session completion time
%i Session elapsed time (session completion time-session start time)
%g Attaches the session log to the message
%m Name and version of the mapping used in the session
%d Name of the folder containing the session
%n Name of the repository containing the session
%a<filename> Attaches the named file. The file must be local to the Informatica
Server. The following are valid filenames: %a<c:\data\sales.txt> or
%a</users/john/data/sales.txt>
On Windows NT, you can attach a file of any type.
On UNIX, you can only attach text files. If you attach a non-text file,
the send may fail.
Note: The filename cannot include the Greater Than character (>)
or a line break.

The PowerCenter Server on UNIX uses rmail to send post-session email. The repository
user who starts the PowerCenter server must have the rmail tool installed in the path in
order to send email.
To verify the rmail tool is accessible:
1. Login to the UNIX system as the PowerCenter user who starts the PowerCenter
Server.
2. Type rmail <fully qualified email address> at the prompt and press Enter.
3. Type '.' to indicate the end of the message and press Enter.
4. You should receive a blank email from the PowerCenter user's email account. If
not, locate the directory where rmail resides and add that directory to the path.
5. When you have verified that rmail is installed correctly, you are ready to send
post-session email.
The output should look like the following:
Session complete.
Session name: sInstrTest
Total Rows Loaded = 1
Total Rows Rejected = 0
Completed

Rows
Loaded
Rows
Rejected
ReadThroughput
(bytes/sec)
WriteThroughput
(rows/sec)
Table Name
Status
1 0 30 1 t_Q3_sales
No errors encountered.
Start Time: Tue Sep 14 12:26:31 1999
Completion Time: Tue Sep 14 12:26:41 1999
Elapsed time: 0:00:10 (h:m:s)
This information, or a subset, can also be sent to any text pager that accepts email.
Backup Strategy Recommendation
Q: Can individual objects within a repository be restored from the backup or from a
prior version?
At the present time, individual objects cannot be restored from a backup using the
PowerCenter Repository Manager (i.e., you can only restore the entire repository). But,
it is possible to restore the backup repository into a different database and then
manually copy the individual objects back into the main repository.
Another option is to export individual objects to XML files. This allows for the granular
re-importation of individual objects, mappings, tasks, workflows, etc.

Refer to Migration Procedures for details on promoting new or changed objects between
development, test, QA, and production environments.
Server Administration
Q: What built-in functions does PowerCenter provide to notify someone in the event
that the server goes down, or some other significant event occurs?
The Repository Server can be used to send messages notifying users that the server
will be shut down. Additionally, the Repository Server can be used to send notification
messages about repository objects that are created, modified or deleted by another
user. Notification messages are received through the PowerCenterClient tools.
Q: What system resources should be monitored? What should be considered normal or
acceptable server performance levels?
The pmprocs utility, which is available for UNIX systems only, shows the currently
executing PowerCenter processes.
Pmprocs is a script that combines the ps and ipcs commands. It is available through
Informatica Technical Support. The utility provides the following information:
CPID - Creator PID (process ID)
LPID - Last PID that accessed the resource
Semaphores - used to sync the reader and writer
0 or 1 - shows slot in LM shared memory
(See Chapter 16 in the PowerCenter Repository Guide for additional details.)
A variety of UNIX and Windows NT commands and utilities are also available. Consult
your UNIX and/or Windows NT documentation.
Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an
Oracle instance crash?
If the UNIX server crashes, you should first check to see if the repository database is
able to come back up successfully. If this is the case, then you should try to start the
PowerCenter server. Use the pmserver.err log to check if the server has started
correctly. You can also use ps -ef | grep pmserver to see if the server process (the Load
Manager) is running.
Metadata
Q: What recommendations or considerations exist as to naming standards or repository
administration for metadata that might be extracted from the PowerCenter repository
and used in others?
With PowerCenter, you can enter description information for all repository objects,
sources, targets, transformations, etc, but the amount of metadata that you enter
should be determined by the business requirements. You can also drill down to the
column level and give descriptions of the columns in a table if necessary. All

information about column size and scale, datatypes, and primary keys are stored in the
repository.
The decision on how much metadata to create is often driven by project timelines.
While it may be beneficial for a developer to enter detailed descriptions of each column,
expression, variable, etc, it is also very time consuming to do so. Therefore, this
decision should be made on the basis of how much metadata will be required by the
systems that use the metadata.
There are some time saving tools that are available to better manage a metadata
strategy and content, such as third party metadata software and, for sources and
targets, data modeling tools.
Q: What procedures exist for extracting metadata from the repository?
Informatica offers an extremely rich suite of metadata-driven tools for data
warehousing applications. All of these tools store, retrieve, and manage their metadata
in Informatica's PowerCenter repository. The motivation behind the original Metadata
Exchange (MX) architecture was to provide an effective and easy-to-use interface to the
repository.
Today, Informatica and several key Business Intelligence (BI) vendors, including Brio,
Business Objects, Cognos, and MicroStrategy, are effectively using the MX views to
report and query the Informatica metadata.
Informatica strongly discourages accessing the repository directly, even for SELECT
access because some releases of PowerCenter change the look and feel of the
repository tables, resulting in a maintenance task for you. Rather, views have been
created to provide access to the metadata stored in the repository.
Additional products, such as Informatica's Metadata Reporter and PowerAnalyzer, allow
for more robust reporting against the repository database and are able to present
reports to the end-user and/or management.
Q: How can I keep multiple copies of the same object within PowerCenter?
A: With PowerCenter 7.x, you can use version control to maintain previous copies of
every changed object.
You can enable version control after you create a repository. Version control allows you
to maintain multiple versions of an object, control development of the object, and track
changes. You can configure a repository for versioning when you create it, or you can
upgrade an existing repository to support versioned objects.
When you enable version control for a repository, the repository assigns all versioned
objects version number 1 and each object has an active status.
You can perform the following tasks when you work with a versioned object:

View object version properties. Each versioned object has a set of version
properties and a status. You can also configure the status of a folder to freeze all
objects it contains or make them active for editing.
Track changes to an object. You can view a history that includes all versions of
a given object, and compare any version of the object in the history to any other
version. This allows you to determine changes made to an object over time.
Check the object version in and out. You can check out an object to reserve it
while you edit the object. When you check in an object, the repository saves a
new version of the object and allows you to add comments to the version. You
can also find objects checked out by yourself and other users.
Delete or purge the object version. You can delete an object from view and
continue to store it in the repository. You can recover, or undelete, deleted
objects. If you want to permanently remove an object version, you can purge it
from the repository.
Q: Is there a way to migrate only the changed objects from Development to Production
without having to spend too much time on making a list of all changed/affected
objects?
A: Yes there is.
You can create Deployment Groups that allow you to group versioned objects for
migration to a different repository
You can create the following types of deployment groups:
Static. You populate the deployment group by manually selecting objects.
Dynamic. You use the result set from an object query to populate the deployment
group.
To make a smooth transition/migration to Production, you need to have a query
associated with your Dynamic deployment group. When you associate an object query
with the deployment group, the Repository Agent runs the query at the time of
deployment. You can associate an object query with a deployment group when you edit
or create a deployment group.
If the repository is enabled for versioning, you may also copy the objects in a
deployment group from one repository to another. Copying a deployment group allows
you to copy objects in a single copy operation from across multiple folders in the source
repository into multiple folders in the target repository. Copying a deployment group
also allows you to specify individual objects to copy, rather than the entire contents of a
folder.
Q: Can I do load balance PowerCenter sessions?
A: The current latest version 7of PowerCenter allows you to set up a Server Grid.
When you create a server grid, you can add PowerCenter Servers to the grid. When you
run a workflow against a PowerCenter Server in the grid, that server becomes the
master server for the workflow. The master server runs all non-session tasks and

assigns session tasks to run on other servers that are defined in the grid. The other
servers become worker servers for that workflow run.
You can add servers to a server grid at any time. When a server starts up, it connects
to the grid and can run sessions from master servers and distribute sessions to worker
servers in the grid. The Workflow Monitor communicates with the master server to
monitor progress of workflows, get session statistics, retrieve performance details, and
stop or abort the workflow or task instances.
If a PowerCenter Server loses its connection to the grid, it tries to re-establish a
connection. You do not need to restart the server for it to connect to the grid. If a
PowerCenter Server is not connected to the server grid, the other PowerCenter Servers
in the server grid do not send it tasks.
Q: How does Web Services Hub work in version 7 of PowerCenter?
A: The Web Services Hub is a PowerCenter Service gateway for external clients. It
exposes PowerCenter functionality through a service-oriented architecture. It receives
requests from web service clients and passes them to the PowerCenter Server or the
Repository Server. The PowerCenter Server or Repository Server processes the
requests and send a response to the web service client through the Web Services Hub.
The Web Services Hub hosts Batch Web Services, Metadata Web Services, and Real -
time Web Services.
Install the Web Services Hub on an application server and configure information such as
repository login, session expiry and log buffer sizes.
The Web Services Hub connects to the Repository Server and the PowerCenter Server
through TCP/IP. Web service clients log in to the Web Services Hub through HTTP(s).
The Web Services Hub authenticates the client based on repository user name and
password. You can use the Web Services Hub console to view service information and
download Web Services Description Language (WSDL) files necessary for running
services and workflows.


Key Management in Data Warehousing Solutions
Challenge
Key management refers to the technique that manages key allocation in a decision
support RDBMS to create a single view of reference data from multiple sources.
Informatica recommends a concept of key management that ensures loading
everything extracted from a source system into the data warehouse.
This Best Practice provides some tips for employing the Informatica-recommended
approach of key management, an approach that deviates from many traditional data
warehouse solutions that apply logical and data warehouse (surrogate) key strategies
where errors are loaded and transactions rejected from referential integrity issues.
Description
Key management in a decision support RDBMS comprises three techniques for handling
the following common situations:
Key merging/matching
Missing keys
Unknown keys
All three methods are applicable to a Reference Data Store, whereas only the missing
and unknown keys are relevant for an Operational Data Store (ODS). Key management
should be handled at the data integration level, thereby making it transparent to the
Business Intelligence layer.
Key Merging/Matching
When companies source data from more than one transaction system of a similar type,
the same object may have different, non-unique legacy keys. Additionally, a single key
may have several descriptions or attributes in each of the source systems. The
independence of these systems can result in incongruent coding, which poses a greater
problem than records being sourced from multiple systems.
A business can resolve this inconsistency by undertaking a complete code
standardization initiative (often as part of a larger metadata management effort) or
applying a Universal Reference Data Store (URDS). Standardizing code requires an
object to be uniquely represented in the new system. Alternatively, URDS contains

universal codes for common reference values. Most companies adopt this pragmatic
approach, while embarking on the longer term solution of code standardization.
The bottom line is that nearly every data warehouse project encounters this issue and
needs to find a solution in the short term.
Missing Keys
A problem arises when a transaction is sent through without a value in a column where
a foreign key should exist (i.e., a reference to a key in a reference table). This normally
occurs during the loading of transactional data, although it can also occur when loading
reference data into hierarchy structures. In many older data warehouse solutions, this
condition would be identified as an error and the transaction row would be rejected.
The row would have to be processed through some other mechanism to find the correct
code and loaded at a later date. This is often a slow and cumbersome process that
leaves the data warehouse incomplete until the issue is resolved.
The more practical way to resolve this situation is to allocate a special key in place of
the missing key, which links it with a dummy 'missing key' row in the related table. This
enables the transaction to continue through the loading process and end up in the
warehouse without further processing. Furthermore, the row ID of the bad transaction
can be recorded in an error log, allowing the addition of the correct key value at a later
time.
The major advantage of this approach is that any aggregate values derived from the
transaction table will be correct because the transaction exists in the data warehouse
rather than being in some external error processing file waiting to be fixed.
Simple Example:
PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE
Audi TT18 Doe10224 1 35,000
In the transaction above, there is no code in the SALES REP column. As this row is
processed, a dummy sales rep key (UNKNOWN) is added to the record to link to a
record in the SALES REP table. A data warehouse key (8888888) is also added to the
transaction.
PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE DWKEY
Audi TT18 Doe10224 9999999 1 35,000
8888888
The related sales rep record may look like this:
REP CODE REP NAME REP MANAGER
1234567 David Jones Mark Smith
7654321 Mark Smith

9999999 Missing Rep
An error log entry to identify the missing key on this transaction may look like:
ERROR CODE TABLE NAME KEY NAME KEY
MSGKEY ORDERS SALES REP 8888888
This type of error reporting is not usually necessary because the transactions with
missing keys can be identified using standard end-user reporting tools against the data
warehouse.
Unknown Keys
Unknown keys need to be treated much like missing keys except that the load process
has to add the unknown key value to the referenced table to maintain integrity rather
than explicitly allocating a dummy key to the transaction. The process also needs to
make two error log entries. The first, to log the fact that a new and unknown key has
been added to the reference table and a second to record the transaction in which the
unknown key was found.
Simple example:
The sales rep reference data record might look like the following:
DWKEY REP NAME REP MANAGER
1234567 David Jones Mark Smith
7654321 Mark Smith
9999999 Missing Rep
A transaction comes into ODS with the record below:
PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE
Audi TT18 Doe10224 2424242 1 35,000
In the transaction above, the code 2424242 appears in the SALES REP column. As this
row is processed, a new row has to be added to the Sales Rep reference table. This
allows the transaction to be loaded successfully.
2424242 Unknown
A data warehouse key (8888889) is also added to the transaction.
PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE DWKEY

Audi TT18 Doe10224 2424242 1 35,000
8888889
Some warehouse administrators like to have an error log entry generated to identify
the addition of a new reference table entry. This can be achieved simply by adding the
following entries to an error log.
NEWROW SALES REP SALES REP 2424242
A second log entry can be added with the data warehouse key of the transaction in
which the unknown key was found.
UNKNKEY ORDERS SALES REP 8888889
As with missing keys, error reporting is not essential because the unknown status is
clearly visible through the standard end-user reporting.
Moreover, regardless of the error logging, the system is self-healing because the newly
added reference data entry will be updated with full details as soon as these changes
appear in a reference data feed.
This would result in the reference data entry looking complete.
2424242 David Digby Mark Smith
Employing the Informatica recommended key management strategy produces the
following benefits:
All rows can be loaded into the data warehouse
All objects are allocated a unique key
Referential integrity is maintained
Load dependencies are removed


Mapping Design
Challenge
Optimizing PowerCenter to create an efficient execution environment.
Description
Although PowerCenter environments vary widely, most sessions and/or mappings can
benefit from the implementation of common objects and optimization procedures.
Follow these procedures and rules of thumb when creating mappings to help ensure
optimization.
General Suggestions for Optimizing
1. Reduce the number of transformations. There is always overhead involved in
moving data between transformations.
2. Consider more shared memory for large number of transformations. Session
shared memory between 12MB and 40MB should suffice.
3. Calculate once, use many times.
o Avoid calculating or testing the same value over and over.
o Calculate it once in an expression, and set a True/False flag.
o Within an expression, use variable ports to calculate a value than can be
used multiple times within that transformation.
4. Only connect what is used.
o Delete unnecessary links between transformations to minimize the amount
of data moved, particularly in the Source Qualifier.
o This is also helpful for maintenance. If a transformation needs to be
reconnected, it is best to only have necessary ports set as input and
output to reconnect.
o In lookup transformations, change unused ports to be neither input nor
output. This makes the transformations cleaner looking. It also makes
the generated SQL override as small as possible, which cuts down on the
amount of cache necessary and thereby improves performance.
5. Watch the data types.
o The engine automatically converts compatible types.
o Sometimes data conversion is excessive. Data types are automatically
converted when types are different between connected ports. Minimize
data type changes between transformations by planning data flow prior
to developing the mapping.
6. Facilitate reuse.

o Plan for reusable transformations upfront.
o Use variables. Use both mapping variables as well as ports that are
variables. Variable ports are especially beneficial when they can be used
to calculate a complex expression or perform a disconnected lookup call
only once instead of multiple times
o Use mapplets to encapsulate multiple reusable transformations.
o Use mapplets to leverage the work of critical developers and minimize
mistakes when performing similar functions.
7. Only manipulate data that needs to be moved and transformed.
o Reduce the number of non-essential records that are passed through the
entire mapping.
o Use active transformations that reduce the number of records as early in
the mapping as possible (i.e., placing filters, aggregators as close to
source as possible).
o Select appropriate driving/master table while using joins. The table with the
lesser number of rows should be the driving/master table for a faster
join.
8. Utilize single-pass reads.
o Redesign mappings to utilize one Source Qualifier to populate multiple
targets. This way the server reads this source only once. If you have
different Source Qualifiers for the same source (e.g., one for delete and
one for update/insert), the server reads the source for each Source
Qualifier.
o Remove or reduce field-level stored procedures.
o If you use field-level stored procedures, the PowerCenter server has to
make a call to that stored procedure for every row, slowing performance.
Lookup Transformation Optimizing Tips
1. When your source is large, cache lookup table columns for those lookup tables
of 500,000 rows or less. This typically improves performance by 10 to 20
percent.
2. The rule of thumb is not to cache any table over 500,000 rows. This is only true
if the standard row byte count is 1,024 or less. If the row byte count is more
than 1,024, then the 500k rows will have to be adjusted down as the number of
bytes increase (i.e., a 2,048 byte row can drop the cache row count to between
250K and 300K, so the lookup table should not be cached in this case). This is
just a general rule though. Try running the session with a large lookup cached
and not cached. Caching is often still faster on very large lookup tables.
3. When using a Lookup Table Transformation, improve lookup performance by
placing all conditions that use the equality operator = first in the list of
conditions under the condition tab.
4. Cache only lookup tables if the number of lookup calls is more than 10 to 20
percent of the lookup table rows. For fewer number of lookup calls, do not cache
if the number of lookup table rows is large. For small lookup tables(i.e., less
than 5,000 rows), cache for more than 5 to 10 lookup calls.
5. Replace lookup with decode or IIF (for small sets of values).
6. If caching lookups and performance is poor, consider replacing with an
unconnected, uncached lookup.
7. For overly large lookup tables, use dynamic caching along with a persistent
cache. Cache the entire table to a persistent file on the first run, enable the
update else insert option on the dynamic cache and the engine will never have

to go back to the database to read data from this table. You can also partition
this persistent cache at run time for further performance gains.
8. Review complex expressions.
Examine mappings via Repository Reporting and Dependency Reporting within the
mapping.
Minimize aggregate function calls.
Replace Aggregate Transformation object with an Expression Transformation
object and an Update Strategy Transformation for certain types of Aggregations.

Operations and Expression Optimizing Tips
1. Numeric operations are faster than string operations.
2. Optimize char-varchar comparisons (i.e., trim spaces before comparing).
3. Operators are faster than functions (i.e., || vs. CONCAT).
4. Optimize IIF expressions.
5. Avoid date comparisons in lookup; replace with string.
6. Test expression timing by replacing with constant.
7. Use flat files.
Using flat files located on the server machine loads faster than a database located
in the server machine.
Fixed-width files are faster to load than delimited files because delimited files
require extra parsing.
If processing intricate transformations, consider loading first to a source flat file
into a relational database, which allows the PowerCenter mappings to access the
data in an optimized fashion by using filters and custom SQL Selects where
appropriate.
8. If working with data that is not able to return sorted data (e.g., Web Logs),
consider using the Sorter Advanced External Procedure.
9. Use a Router Transformation to separate data flows instead of multiple Filter
Transformations.
10. Use a Sorter Transformation or hash-auto keys partitioning before an
Aggregator Transformation to optimize the aggregate. With a Sorter
Transformation, the Sorted Ports option can be used, even if the original source
cannot be ordered.
11. Use a Normalizer Transformation to pivot rows rather than multiple instances of
the same target.
12. Rejected rows from an update strategy are logged to the bad file. Consider
filtering before the update strategy if retaining these rows is not critical because
logging causes extra overhead on the engine. Choose the option in the update
strategy to discard rejected rows.
13. When using a Joiner Transformation, be sure to make the source with the
smallest amount of data the Master source.
14. If an update override is necessary in a load, consider using a Lookup
transformation just in front of the target to retrieve the primary key. The
primary key update will be much faster than the non-indexed lookup override.
Suggestions for Using Mapplets

A mapplet is a reusable object that represents a set of transformations. It allows you to
reuse transformation logic and can contain as many transformations as necessary. Use
the Mapplet Designer to create mapplets.
1. Create a mapplet when you want to use a standardized set of transformation
logic in several mappings. For example, if you have several fact tables that
require a series of dimension keys, you can create a mapplet containing a series
of Lookup transformations to find each dimension key. You can then use the
mapplet in each fact table mapping, rather than recreate the same lookup logic
in each mapping.
2. To create a mapplet, add, connect, and configure transformations to complete
the desired transformation logic. After you save a mapplet, you can use it in a
mapping to represent the transformations within the mapplet. When you use a
mapplet in a mapping, you use an instance of the mapplet. All uses of a mapplet
are tied to the parent mapplet. Hence, all changes made to the parent mapplet
logic are inherited by every child instance of the mapplet. When the server runs
a session using a mapplet, it expands the mapplet. The server then runs the
session as it would any other session, passing data through each transformation
in the mapplet as designed.
3. A mapplet can be active or passive depending on the transformations in the
mapplet. Active mapplets contain at least one active transformation. Passive
mapplets only contain passive transformations. Being aware of this property
when using mapplets can save time when debugging invalid mappings.
4. Unsupported transformations that should not be used in a mapplet include:
COBOL source definitions, normalizer, non-reusable sequence generator, pre- or
post-session stored procedures, target definitions, and PowerMart 3.5-style
lookup functions.
5. Do not reuse mapplets if you only need one or two transformations of the
mapplet while all other calculated ports and transformations are obsolete.
6. Source data for a mapplet can originate from one of two places:
Sources within the mapplet. Use one or more source definitions connected to a
Source Qualifier or ERP Source Qualifier transformation. When you use the
mapplet in a mapping, the mapplet provides source data for the mapping and is
the first object in the mapping data flow.
Sources outside the mapplet. Use a mapplet Input transformation to define
input ports. When you use the mapplet in a mapping, data passes through the
mapplet as part of the mapping data flow.
7. To pass data out of a mapplet, create mapplet output ports. Each port in an
Output transformation connected to another transformation in the mapplet
becomes a mapplet output port.
Active mapplets with more than one Output transformations. You need one
target in the mapping for each Output transformation in the mapplet. You
cannot use only one data flow of the mapplet in a mapping.
Passive mapplets with more than one Output transformations. Reduce to
one Output Transformation; otherwise you need one target in the mapping for
each Output transformation in the mapplet. This means you cannot use only one
data flow of the mapplet in a mapping.


Mapping Templates
Challenge
Mapping Templates demonstrate proven solutions for tackling challenges that
commonly occur during data integration development efforts. Mapping Templates can
be used to make the development phase of a project more efficient. Mapping Templates
can also serve as a medium to introduce development standards into the mapping
development process that developers need to follow.
A wide array of Mapping Template examples can be obtained for the most current
PowerCenter version from the Informatica Customer Portal. As "templates," each of the
objects in Informatica's Mapping Template Inventory illustrates the transformation logic
and steps required to solve specific data integration requirements. These sample
templates, however, are meant to be used as examples, not as means to implement
development standards.
Description

Reuse Transformation Logic
Templates can be heavily used in a data integration and warehouse environment, when
loading information from multiple source providers into the same target structure, or
when similar source system structures are employed to load different target instances.
Using templates guarantees that any transformation logic that is developed and tested
correctly, once, can be successfully applied across multiple mappings as needed. In
some instances, the process can be further simplified if the source/target structures
have the same attributes, by simply creating multiple instances of the session, each
with its own connection/execution attributes, instead of duplicating the mapping.
Implementing Development Techniques
When the process is not simple enough to allow usage based on the need to duplicate
transformation logic to load the same target, Mapping Templates can help to reproduce
transformation techniques. In this case, the implementation process requires more than
just replacing source/target transformations. This scenario is most useful when certain
logic (i.e., logical group of transformations) is employed across mappings. In many
instances this can be further simplified by making use of mapplets.

Transport mechanism
Once Mapping Templates have been developed, they can be distributed by any of the
following procedures:
Copy mapping from development area to the desired repository/folder
Export mapping template into XML and import to the desired repository/folder.
Mapping template examples
The following Mapping Templates can be downloaded from the Informatica Customer
Portal and are listed by subject area:
Common Data Warehousing Techniques
Aggregation using Sorted Input
Tracking Dimension History
Constraint-Based Loading
Loading Incremental Updates
Tracking History and Current
Inserts or Updates
Transformation Techniques
Error Handling Strategy
Flat File Creation with Headers and Footers
Removing Duplicate Source Records
Transforming One Record into Multiple Records
Dynamic Caching
Sequence Generator Alternative
Streamline a Mapping with a Mapplet
Reusable Transformations (Customers)
Using a Sorter
Pipeline Partitioning Mapping Template
Using Update Strategy to Delete Rows
Loading Heterogenous Targets
Load Using External Procedure
Advanced Mapping Concepts
Aggregation Using Expression Transformation
Building a Parameter File
Best Build Logic
Comparing Values Between Records
Source-Specific Requirements
Processing VSAM Source Files
Processing Data from an XML Source
Joining a Flat File with a Relational Table

Industry-Specific Requirements
Loading SWIFT 942 Messages.htm
Loading SWIFT 950 Messages.htm


Naming Conventions
Challenge
Choosing a good naming standard for use in the repository and adhering to it.
Description
Although naming conventions are important for all repository and database objects, the
suggestions in this Best Practice focus on the former. Choosing a convention and
sticking with it is the key.
Having a good naming convention will facilitate smooth migration and improve
readability for anyone reviewing or carrying out maintenance on the repository objects
by helping them easily understand the processes being effected. If consistent names
and descriptions are not used, more time will be taken to understand the working of
mappings and transformation objects. If there is no description, a developer will have
to spend considerable time going through an object or mapping to understand its
objective.
The following pages offer some suggestions for naming conventions for various
repository objects. Whatever convention is chosen, it is important to do this very early
in the development cycle and communicate the convention to project staff working on
the repository. The policy can be enforced by peer review and at test phases by adding
process to check conventions to test plans and test execution documents.
Suggested Naming Conventions
Transformation Objects Suggested Naming Conventions
Application Source Qualifier ASQ_TransformationName _SourceTable1_SourceTable2.
Represents data from application source.
Expression Transformation: EXP_Function exp that leverages the expression
and/or a name that describes the processing being done.
Custom CT_Tranformation name that describe the processing being
done.
Sequence Generator
Transformation
SEQ_Descriptor if using keys for a target table entity, then
refer to that
Lookup Transformation: LKP_ Use lookup table name or item being obtained by

Transformation Objects Suggested Naming Conventions
lookup since there can be different lookups on a single
table.
Source Qualifier
Transformation:
SQ_{SourceTable1}_{SourceTable2}. Using all source
tables can be impractical if there are a lot of tables in a
source qualifier, so refer to the type of information being
obtained, for example a certain type of product
SQ_SALES_INSURANCE_PRODUCTS.
Aggregator Transformation: AGG_{function} that leverages the expression or a name
that describes the processing being done.
Filter Transformation: FIL_ or FILT_ {function } that leverages the expression or a
name that describes the processing being done.
Update Strategy
Transformation:
UPD_{TargetTableName(s)} that leverages the expression
or a name that describes the processing being done. If the
update is carryout inserts or updates only, then add insert
/ins update / upd to the name. E.g.,
UPD_UPDATE_EXISTING_EMPLOYEES
MQ Source Qualifier SQ_MQ_Descriptor defines the messaging being selected.
Normalizer Transformation: NRM_{TargetTableName(s)} that leverages the expression
or a name that describes the processing being done.
Union UN_Descriptor
Router Transformation RTR_{Descriptor}
XML Generator XMG_Descriptor defines the target message.
XML Parser XMP_Descriptor defines the messaging being selected.
XML Source Qualifier XMSQ_Descriptor defines the data being selected.
Rank Transformation: RNK_{TargetTableName(s)} that leverages the expression
or a name that describes the processing being done.
Stored Procedure
Transformation:
SP_{StoredProcedureName}
External Procedure
Transformation:
EXT_{ProcedureName}
Joiner Transformation: JNR_{SourceTable/FileName1}_ {SourceTable/FileName2}
or use more general descriptions for the content in the data
flows as joiners are not only used to provide pure joins
between heterogeneous source tables and files.
Target TGT_Target_Name
Mapplet Transformation: mplt_{description}
Mapping Name: m_{target}_{descriptor}
Email Object email_{Descriptor}
Port Names
Ports names should remain the same as the source unless some other action is
performed on the port. In that case, the port should be prefixed with the appropriate
name.
When the developer brings a source port into a lookup or expression, the port should
be prefixed with IN_. This will help the user immediately identify the ports that are
being inputted without having to line up the ports with the input checkbox.

Generated output ports can also be prefixed. This helps trace the port value throughout
the mapping as it may travel through many other transformations. If it is intended to
be able to use the autolink feature based on names, then outputs may be better left as
the name of the target port in the next transformation. For variables inside a
transformation, the developer could use the prefix v, 'var_ or v_' plus a meaningful
name.
The following port standards will be applied when creating a transformation object. The
exceptions are the Source Definition, the Source Qualifier, the Lookup, and the Target
Definition ports, which must not change since the port names are used to retrieve data
from the database.
Other transformations that are not applicable to the port standards are:
Normalizer: The ports created in the Normalizer are automatically formatted when
the developer configures it.
Sequence Generator: The ports are reserved words.
Router: The output ports are automatically created; therefore prefixing the input
ports with an I_ will prefix the output ports with I_ as well. The port names
should not have any prefix.
Sorter, Update Strategy, Transaction Control, and Filter: The ports are always
input and output. There is no need to rename them unless they are prefixed.
Prefixed port names should be removed.
Union: The group ports are automatically assigned to the input and output;
therefore prefixing with anything is reflected in both the input and output. The
port names should not have any prefix.
All other transformation object ports can be prefixed or suffixed with:
in_ or i_for Input ports
o_ or _out for Output ports
io_ for Input/Output ports
v,v_ or var_ for variable ports
They can also:
Have the Source Qualifier port name.
Be unique.
Be meaningful.
Be given the target port name.

Transformation Descriptions
This section defines the standards to be used for transformation descriptions in the
Designer.
Source qualifier description
The description should include the aim of the source qualifier and the data it is intended
to select. It should also indicate if any SQL overrides are used. If so, it should describe

the filters used. Some project prefer the SQL statement to be included in the
description as well.
Lookup transformation description
Describe the lookup along the lines of the [lookup attribute] obtained from [lookup
table name] to retrieve the [lookup attribute name].
Where:
Lookup attribute is the name of the column being passed into the lookup and is
used as the lookup criteria.
Lookup table name is the table on which the lookup is being performed.
Lookup attribute name is the name of the attribute being returned from the
lookup. If appropriate, specify the condition when the lookup is actually
executed.
It is also important to note lookup features such as persistent cache or dynamic lookup.
Expression transformation description
Each Expression transformation description must be in the format:
This expression [explanation of what transformation does].
Expressions can be distinctly different depending on the situation; therefore the
explanation should be specific to the actions being performed.
Within each Expression, transformation ports have their own description in the format:
This port [explanation of what the port is used for].
Aggregator transformation descriptions
Each Aggregator transformation description must be in the format:
This Aggregator [explanation of what transformation does].
Aggregators can be distinctly different, depending on the situation; therefore the
explanation should be specific to the actions being performed.
Within each Aggregator, transformation ports have their own description in the format:
This port [explanation of what the port is used for].
Sequence generators transformation descriptions
Each Sequence Generator transformation description must be in the format:

This Sequence Generator provides the next value for the [column name] on the [table
name].
Where:
table name is the table being populated by the sequence number and the
column name is the column within that table being populated.
Joiner transformation descriptions
Each Joiner transformation description must be in the format::
This Joiner uses [joining field names] from [joining table names].
Where:
joining field names are the names of the columns on which the join is done and
the
joining table names are the tables being joined.
Normalizer transformation descriptions
Each Normalizer transformation description must be in the format::
This Normalizer [explanation].
Where explanation is an explanation of what the Normalizer does.
Filter transformation description
Each Filter transformation description must be in the format:
This Filter processes [explanation].
Where explanation is an explanation of what the filter criteria are and what they do.
Stored procedure transformation descriptions
An explanation of the stored procedures functionality within the mapping. What does it
return in relation to the input ports?
Input transformation descriptions
Describe the input values and their intended use in the mapplet
Output transformation descriptions
Describe the output ports and the subsequent use of those values. As an example, for
an exchange rate mapplet, describe what currency the output value will be in. Answer

the questions like: is the currency it fixed or is based on other data? What kind of rate
is used? is it a fixed inter-company rate? an interbank rate? business rate or tourist
rate? Has the conversion gone through an intermediate currency?
Update strategies transformation description
Describe what the Update Strategy does and whether it is fixed in its function or
determined by a calculation.
Sorter transformation description
An explanation contains the port(s) that are being sorted and their sort direction.
Router transformation description
An explanation that describes the groups and their function.
Union transformation description
Describe the source inputs and indicate what further processing on those inputs (if any)
is expected to take place in later transformations in the mapping.
Transaction control transformation description
Describe the process behind the transaction control and the function of the control to
commit or rollback.
Mapping Comments
Describe the source data obtained and the structure file, table or facts and dimensions
that it populates. Remember to use business terms along with more technical details
such as table names. This will help when maintenance has to be carried out or if issues
arise that need to be discussed with business analysts.
Mapplet Comments
An explanation of the process that the mapplet carries out. Also see notes for the
description for the input and output transformation.
Shared Objects
Any object within a folder can be shared. These objects are sources, targets, mappings,
transformations, and mapplets. To share objects in a folder, the folder must be
designated as shared. Once the folder is shared, users are allowed to create shortcuts
to objects in the folder.
If the developer has an object that he or she wants to use in several mappings or
across multiple folders, like an Expression transformation that calculates sales tax, the
developer can place the object in a shared folder. Then use the object in other folders

by creating a shortcut to the object. In this case, the naming convention is SC_ for
instance SC_mltCREATION_SESSION, SC_DUAL.
Shared Folders
Shared folders are used when objects are needed across folders but the developer
wants to maintain them in only one central location. In addition to ease of
maintenance, shared folders help reduce the size of the repository since shortcuts are
used to link to the original, instead of copies.
Only users with the proper permissions can access these shared folders. It is the
responsibility of these users to migrate the folders across the repositories and to
maintain the objects within those folders with the help of the developers. For instance,
if an object is created by a developer and it is to be shared, the developer will provide
details of the object and the level at which the object is to be shared before the
Administrator will accept it as a valid entry into the shared folder. The developers, not
necessarily the creator, control the maintenance of the object, as they will need to
ensure that a change they require will not negatively impact other objects.
Workflow Manager Objects
WorkFlow Objects Suggested Naming Convention
Session Name: s_{MappingName}
Command Object cmd_{Descriptor}
WorkLet Name Wk or Wklt_{Descriptor}
Workflow Names: Wkf or wf_{Workflow Descriptor}
Email Task: Email_ or eml_{Email Descriptor}
Decision Task: dcn_{Condition_Descriptor}
Assign Task: asgn_{Variable_Descriptor }
Timer Task: Timer_ or tim_{Descriptor}
Control Task: ctl_{WorkFlow_Descriptor} Specify when and how the
PowerCenter Server to stop or abort a workflow by using
the Control task in the workflow.
Event Wait Task: Wait_ or evtw_{Event_Descriptor} The Event-Wait task
waits for an event to occur. Once the event triggers, the
PowerCenter Server continues executing the rest of the
workflow.
Event Raise Task: Raise_ or evtr_ {Event_Descriptor} Event-Raise task
represents a user-defined event. When the PowerCenter
Server runs the Event-Raise task, the Event-Raise task
triggers the event. Use the Event-Raise task with the Event-
Wait task to define events.
ODBC Data Source Names
Be sure to set up all Open Database Connectivity (ODBC) data source names (DSNs)
the same way on all client machines. PowerCenter uniquely identifies a source by its
Database Data Source (DBDS) and its name. The DBDS is the same name as the ODBC
DSN since the PowerCenter Client talks to all databases through ODBC.

Also setup the ODBC DSNs as system DSNs so that all users of a machine can see the
DSN. This approach ensures that there is less chance of a discrepancy creeping in
among users when they use different (i.e., colleagues') machines and have to recreate
a new DSN when they use a separate machine.
If ODBC DSNs are different across multiple machines, there is a risk of analyzing the
same table using different names. For example, machine1 has ODBS DSN Name0 that
points to database1. TableA gets analyzed in on machine 1. TableA is uniquely
identified as Name0.TableA in the repository. Machine2 has ODBS DSN Name1 that
points to database1. TableA gets analyzed in on machine 2. TableA is uniquely
identified as Name1.TableA in the repository. The result is that the repository may refer
to the same object by multiple names, creating confusion for developers, testers, and
potentially end users.
Also, refrain from using environment tokens in the ODBC DSN. For example, do not call
it dev_db01. When migrating objects from dev, to test, to prod, PowerCenter will wind
up with source objects called dev_db01 in the production repository. ODBC database
names should clearly describe the database they reference to ensure that users do not
incorrectly point sessions to the wrong databases.
Database Connection Information
Security considerations may dictate that the company name of the database or project
be used instead of {user}_{database name} except for developer scratch schemas that
are not found in test or production environments. Be careful not to include machine
names or environment tokens in the database connection name. Database connection
names must be very generic to be understandable and ensure a smooth migration.
The convention should be applied across all development, test, and production
environments. This allows seamless migration of sessions when migrating between
environments. If an administrator uses the Copy Folder function for migration, session
information is also copied. If the Database Connection information does not already
exist in the folder the administrator is copying to, it is also copied. So, if the developer
uses connections with names like Dev_DW in the development repository, they will
eventually wind up in the test and even in the production repositories as the folders are
migrated. Manual intervention is then necessary to change connection names, user
names, passwords, and possibly even connect strings.
Instead, if the developer just has a DW connection in each of the three environments,
when the administrator copies a folder from the development environment to the test
environment, the sessions will automatically use the existing connection in the test
repository. With the right naming convention, you can migrate sessions from (??OK??)
the test repository without manual intervention.
Tip: Have the Repository Administrator or DBA setup all connections in all
environments based on the issues discussed in this document at the beginning of a
project and avoid developers creating their own with different conventions and possibly
duplicating connections. These connections can then be protected though permission
options so that only certain individuals can modify them.
PowerCenter PowerExchange Application/Relational Connections

Before the PowerCenter Server can access a source or target in a session, you must
configure connections in the Workflow Manager. When you create or modify a session
that reads from, or writes to, a database, you can select only configured source and
target databases. Connections are saved in the repository.
For PowerExchange Client for PowerCenter, you configure relational database and/or
application connections. The connection you configure depends on the type of source
data you want to extract and the extraction mode.
Source
Type/Extraction
Mode
Application
Connection/Relational
Connection
Connection
Type
Recommended Naming
Convention
DB2/390 Bulk
Mode
Relational PWX DB2390 PWX_batch_DSNName
DB2/390 Change
Mode
Application PWX DB2390
CDC Change
PWX_CDC_DSNName
DB2/390 Real
Time Mode
CDC Real Time
PWX_RT_DSNName
DB2/400 Bulk
Mode
Relational PWX DB2400 PWX_batch_DSNName
DB2/400 Change
Mode
CDC Change
PWX_CDC_DSNName
DB2/400 Real
Time Mode
CDC Real Time
PWX_RT_DSNName
IMS Batch Mode Application PWX NRDB
Batch
PWX_NRDB_Recon_Name
IMS Change Mode Application PWX NRDB
CDC Change
PWX_CDC_Recon_Name
IMS Real Time Application PWX NRDB
CDC Real Time
PWX_RT_Recon_Name
VSAM Batch Mode Application PWX NRDB
Batch
PWX_NRDB_Coll_Identifier
Name
VSAM Change
Mode
Application PWX NRDB
CDC Change
PWX_CDC_Coll_Identifier
Name
VSAM Real Time
Mode
Application PWX NRDB
CDC Real Time
PWX_RT_Coll_Identifier Name
Oracle Real Time Application PWX Oracle
CDC Real
PWX_RT_Instance_Name
PowerCenter PowerExchange Target Connections
The connection you configure depends on the type of target data you want to load.
Target Type Connection Type Recommended Naming
Convention
DB2/390 PWX DB2390 relational
database connection
PWXT_DSNName
DB2/400 PWX DB2400 relational
database connection
PWXT_DSNName



Performing Incremental Loads
Challenge
Data warehousing incorporates very large volumes of data. The process of loading the
warehouse without compromising its functionality and in a reasonable timescale is
extremely difficult. The goal is to create a load strategy that can minimize downtime for
the warehouse and allow quick and robust data management.
Description
As time windows shrink and data volumes increase, it is important to understand the
impact of a suitable incremental load strategy. The design should allow data to be
incrementally added to the data warehouse with minimal impact on the overall system.
This Best Practice describes several possible load strategies.
Incremental Aggregation
Incremental aggregation is useful for applying captured changes in the source to
aggregate calculations in a session. If the source changes only incrementally, and you
can capture changes, you can configure the session to process only those changes. This
allows the PowerCenter Server to update your target incrementally, rather than forcing
it to process the entire source and recalculate the same calculations each time you run
the session.
If the session performs incremental aggregation, the PowerCenter Server saves index
and data cache information to disk when the session finishes. The next time the session
runs, the PowerCenter Server uses this historical information to perform the
incremental aggregation. Set the Incremental Aggregation Session Attribute. For
details see Chapter 22 in the Workflow Administration Guide.
Use incremental aggregation under the following conditions:
Your mapping includes an aggregate function
The source changes only incrementally
You can capture incremental changes (i.e., by filtering source data by timestamp)
You get only delta records (i.e., you may have implemented the CDC (Change
Data Capture) feature of PowerExchange if the source is on a mainframe)
Do not use incremental aggregation in the following circumstances:

You cannot capture new source data
Processing the incrementally changed source significantly changes the target: If
processing the incrementally changed source alters more than half the existing
target, the session may not benefit from using incremental aggregation.
Your mapping contains percentile or median functions
Conditions that lead to making a decision on an incremental strategy are:
Error handling and loading and unloading strategies for recovering, reloading, and
unloading data
History tracking, keeping track of what has been loaded and when
Slowly changing dimensions. Informatica Mapping Wizards are a good start to an
incremental load strategy. The Wizards generate generic mappings as a starting
point (refer to Chapter 14 in the Designer Guide)
Source Analysis
Data sources typically fall into the following possible scenarios:
Delta records Records supplied by the source system include only new or
changed records. In this scenario, all records are generally inserted or updated
into the data warehouse.
Record indicator or flags Records that include columns that specify the
intention of the record to be populated into the warehouse. Records can be
selected based upon this flag for all inserts, updates and deletes.
Date stamped data Data is organized by timestamps. Data is loaded into the
warehouse based upon the last processing date or the effective date range.
Key values are present When only key values are present, data must be
checked against what has already been entered into the warehouse. All values
must be checked before entering the warehouse.
No key values present Surrogate keys are created and all data is inserted into
the warehouse based upon validity of the records.

Identify Which Records Need to be Compared
After the sources are identified, you need to determine which records need to be
entered into the warehouse and how. Here are some considerations:
Compare with the target table. When source delta loads are received determine
if the record exists in the target table. The timestamps and natural keys of the
record are the starting point for identifying whether the record is new, modified
or should be archived. If the record does not exist in the target, insert the
record as a new row. If it does exist, determine if the record needs to be
updated, inserted as a new record, or removed (deleted from target or filtered
out and not added to the target).
Record indicators. Record indicators can be beneficial when lookups into the
target are not necessary. Take care to ensure that the record exists for updates
or deletes, or that the record can be successfully inserted. More design effort
may be needed to manage errors in these situations.


Determine the Method of Comparison
There are three main strategies in mapping design that can be used as a method of
comparison:
Joins of sources to targets - Records are directly joined to the target using
Source Qualifier join conditions or using joiner transformations after the source
qualifiers (for heterogeneous sources). When using joiner transformations, take
care to ensure the data volumes are manageable.
Lookup on target - Using the lookup transformation, lookup the keys or critical
columns in the target relational database. Consider the caches and indexing
possibilities.
Load table log - Generate a log table of records that have already been inserted
into the target system. You can use this table for comparison with lookups or
joins, depending on the need and volume. For example, store keys in a separate
table and compare source records against this log table to determine load
strategy. Another example is to store the dates up to which data has already
been loaded into a log table.

Source-Based Load Strategies
Complete incremental loads in a single file/table
The simplest method for incremental loads is from flat files or a database in which all
records are going to be loaded. This strategy requires bulk loads into the warehouse
with no overhead on processing of the sources or sorting the source records.
Data can be loaded directly from the source locations into the data warehouse. There is
no additional overhead produced in moving these sources into the warehouse.
Date-stamped data
This method involves data that has been stamped using effective dates or sequences.
The incremental load can be determined by dates greater than the previous load date
or data that has an effective key greater than the last key processed.
With the use of relational sources, the records can be selected based on this effective
date and only those records past a certain date are loaded into the warehouse. Views
can also be created to perform the selection criteria. This way, the processing does not
have to be incorporated into the mappings but is kept on the source component.
Placing the load strategy into the other mapping components is much more flexible and
controllable by the data Integration developers and by metadata.
Non-relational data can be filtered as records are loaded based upon the effective dates
or sequenced keys. A router transformation or a filter can be placed after the source
qualifier to remove old records.

To compare the effective dates, you can use mapping variables to provide the previous
date processed. The alternative is to use control tables to store the dates and update
the control table after each load.
For detailed instruction on how to select dates, refer to Using Parameters, Variables
and Parameter Files in Chapter 8 of the Designer Guide.
Changed data based on keys or record information
Data that is uniquely identified by keys can be selected based upon selection criteria.
For example, records that contain key information such as primary keys or alternate
keys can be used to determine if they have already been entered into the data
warehouse. If they exist, you can also check to see if you need to update these records
or discard the source record.
It may be possible to do a join with the target tables in which new data can be selected
and loaded into the target. It may also be feasible to lookup in the target to see if the
data exists or not.
Target-Based Load Strategies

Load directly into the target
Loading directly into the target is possible when the data is going to be bulk loaded.
The mapping will then be responsible for error control, recovery, and update strategy.
Load into flat files and bulk load using an external loader
The mapping will load data directly into flat files. You can then invoke an external
loader to bulk load the data into the target. This method reduces the load times (with
less downtime for the data warehouse) and also provides a means of maintaining a
history of data being loaded into the target. Typically, this method is only used for
updates into the warehouse.
Load into a mirror database
The data is loaded into a mirror database to avoid downtime of the active data
warehouse. After data has been loaded, the databases are switched, making the mirror
the active database and the active the mirror.
Using Mapping Variables and Parameter Files
You can use a mapping variable to perform incremental loading. The mapping variable
is used in the source qualifier or join condition to select only the new data that has
been entered based on the create_date or the modify_date, whichever date can be
used to identify a newly inserted record. However, the source system must have a
reliable date to use.
The steps involved in this method are:

Step 1: Create mapping variable
In the Mapping Designer, choose Mappings-Parameters and Variables. Or, to create
variables for a mapplet, choose Mapplet-Parameters and Variables in the Mapplet
Designer. Click Add and enter the name of the variable. In this case, make your
variable a date/time. For the Aggregation option, select MAX.
In the same screen, state your initial value. This is the date at which the load should
start. The date can use any one of these formats:
MM/DD/RR
MM/DD/RR HH24:MI:SS
MM/DD/YYYY
MM/DD/YYYY HH24:MI:SS
Step 2: Use the mapping variable in the source qualifier
The select statement should look like the following:
Select * from table A
where
CREATE DATE > date($$INCREMENT DATE. MM-DD-YYYY HH24:MI:SS)
Step 3: Use the mapping variable in an expression
For the purpose of this example, use an expression to work with the variable functions
to set and use the mapping variable.
In the expression, create a variable port and use the SETMAXVARIABLE variable
function and do the following:
SETMAXVARIABLE($$INCREMENT_DATE,CREATE_DATE)
CREATE_DATE is the date for which you want to store the maximum value.
You can use the variable functions in the following transformations:
Expression
Filter
Router
Update Strategy
The variable constantly holds (per row) the max value between source and variable. So,
if one row comes through with 9/1/2004, then the variable gets that value. If all
subsequent rows are LESS than that, then 9/1/2004 is preserved.

When the mapping completes, the PERSISTENT value of the mapping variable is stored
in the repository for the next run of your session. You can view the value of the
mapping variable in the session log file.
The advantage of the mapping variable and incremental loading is that it allows the
session to use only the new rows of data. No table is needed to store the max(date)
since the variable takes care of it.
After a successful session run, the PowerCenter Server saves the final value of each
variable in the repository. So when you run your session the next time, only new data
from the source system is captured. If necessary, you can override the value saved in
the repository with a value saved in a parameter file.


Real-Time Integration with PowerCenter
Challenge
Configure PowerCenter to work with PowerCenter Connect to process real-time data.
This Best Practice discusses guidelines for establishing a connection with PowerCenter
and setting up a real-time session to work with PowerCenter.
Description
PowerCenter with the real-time option can be used to integrate third-party messaging
applications using a specific version of PowerCenter Connect. Each PowerCenter
Connect version supports a specific industry-standard messaging application, such as
PowerCenter Connect for MQSeries, PowerCenter Connect for JMS, and PowerCenter
Connect for TIBCO. IBM MQ Series uses a queue to store and exchange data. Other
applications, such as TIBCO and JMS, use a publish/subscribe model. In this case, the
message exchange is identified using a topic.
Connection Setup
PowerCenter uses some attribute values in order to correctly connect and identify the
third-party messaging application and message itself. Each version of PowerCenter
Connect supplies its own connection attributes that need to be configured properly
before running a real-time session.
PowerCenter Connect for MQ
1. In the Workflow Manager, connect to a repository and choose Connection ->
Queue.
2. The Queue Connection Browser appears. Select New -> Message Queue
3. The Connection Object Definition dialog box appears.
You need to specify three attributes in the Connection Object Definition dialog box:
Name - the name for the connection. (Use <queue_name>_<QM_name> to
uniquely identified the connection.)
Queue Manager - the Queue Manager name for the message queue. (in Windows,
the default Queue Manager name is QM_<machine name>).
Queue Name - the Message Queue name

Obtaining the Queue Manager and Message Queue names
Open the MQ Series Administration Console. The Queue Manager should appear on
the left panel.
Expand the Queue Manager icon. A list of the queues for the queue manager
appears on the left panel.
Note that the Queue Managers name and Queue Name are case-sensitive.
PowerCenter Connect for JMS
PowerCenter Connect for JMS can be used to read or write messages from various JMS
providers, such as IBM MQ Series JMS, BEA Weblogic Server, and IBM Websphere.
There are two types of JMS application connections:
JNDI Application Connection, which is used to connect to a JNDI server during a
session run.
JMS Application Connection, which is used to connect to a JMS provider during a
session run.
JNDI Application Connection Attributes:
Name
JNDI Context Factory
JNDI Provider URL
JNDI UserName
JNDI Password
JMS Application Connection
JMS Application Connection Attributes:
Name
JMS Destination Type
JMS Connection Factory Name
JMS Destination
JMS UserName
JMS Password
Configuring the JNDI Connection for IBM MQ Series
The JNDI settings for MQ Series JMS can be configured using a file system service or
LDAP (Lightweight Directory Access Protocol).
The JNDI setting is stored in a file named JMSAdmin.config. The file should be installed
in the MQSeries Java installation/bin directory.
If you are using a file system service provider to store your JNDI settings, remove
the number sign (#) before the following context factory setting:

INITIAL_CONTEXT_FACTORY=com.sun.jndi.fscontext.RefFSContextFactory
Or, if you are using the LDAP service provider to store your JNDI settings, remove
the number sign (#) before the following context factory setting:
INITIAL_CONTEXT_FACTORY=com.sun.jndi.ldap.LdapCtxFactory
Find the PROVIDER_URL settings.
If you are using a file system service provider to store your JNDI settings, remove the
number sign (#) before the following provider URL setting and provide a value for the
JNDI directory.
PROVIDER_URL=file: /<JNDI directory>
<JNDI directory> is the directory where you want JNDI to store the .binding file.
Or, if you are using the LDAP service provider to store your JNDI settings, remove the
number sign (#) before the provider URL setting and specify a hostname.
#PROVIDER_URL=ldap://<hostname>/context_name
For example, you could specify:
PROVIDER_URL=ldap://<localhost>/o=infa,c=rc
If you want to provide a user DN and password for connecting to JNDI, you can remove
the # from the following settings and enter a user DN and password:
PROVIDER_USERDN=cn=myname,o=infa,c=rc
PROVIDER_PASSWORD=test
The following table shows the JMSAdmin.config settings and the corresponding
attributes in the JNDI application connection in the Workflow Manager:
JMSAdmin.config Settings JNDI Application Connection Attribute
INITIAL_CONTEXT_FACTORY JNDI Context Factory
PROVIDER_URL JNDI Provider URL
PROVIDER_USERDN JNDI UserName
PROVIDER_PASSWORD JNDI Password
Configuring the JMS Connection for IBM MQ Series
The JMS connection is defined using a tool in JMS called jmsadmin that is available in
MQ Series Java installation/bin directory. Use this tool to configure the JMS Connection
Factory.
The JMS Connection Factory can be a Queue Connection Factory or Topic Connection
Factory.

When Queue Connection Factory is used, define a JMS queue as the destination.
When Topic Connection Factory is used, define a JMS topic as the destination.
The command to define a queue connection factory (qcf) is:
def qcf(<qcf_name>) qmgr(queue_manager_name)
hostname (QM_machine_hostname) port (QM_machine_port)
The command to define JMS queue is:
def q(<JMS_queue_name>) qmgr(queue_manager_name)
qu(queue_manager_queue_name)
The command to define JMS topic connection factory (tcf) is:
def tcf(<tcf_name>) qmgr(queue_manager_name)
hostname (QM_machine_hostname) port (QM_machine_port)
The command to define the JMS topic is:
def t(<JMS_topic_name>) topic(pub/sub_topic_name)
The topic name must be unique. For example: topic (application/infa)
The following table shows the JMS object types and the corresponding attributes in the
JMS application connection in the Workflow Manager:
JMS Object Types JMS Application Connection Attribute
QueueConnectionFactory or
TopicConnectionFactory
JMS Connection Name
JMS Queue Name or
JMS Topic Name
JMS Destination
Configure the JNDI and JMS Connection for IBM Websphere
Configure the JNDI settings for IBM WebSphere to use IBM WebSphere as a provider
for JMS sources or targets in a PowerCenterRT session.
JNDI Connection
Add the following option to the file JMSAdmin.bat to configure JMS properly:
-Djava.ext.dirs=<WebSphere Application Server>bin
For example:
-Djava.ext.dirs=WebSphere\AppServer\bin
The JNDI connection resides in the JMSAdmin.config file, which is located in the MQ
Series Java/bin directory.

INITIAL_CONTEXT_FACTORY=com.ibm.websphere.naming.wsInitialContextFactory
PROVIDER_URL=iiop://<hostname>/
For example:
PROVIDER_URL=iiop://localhost/
PROVIDER_USERDN=cn=informatica,o=infa,c=rc
PROVIDER_PASSWORD=test
JMS Connection
The JMS configuration is similar to the JMS Connection for IBM MQ Series.
Configure the JNDI and JMS Connection for BEA Weblogic
Configure the JNDI settings for BEA Weblogic to use BEA Weblogic as a provider for JMS
sources or targets in a PowerCenterRT session.
PowerCenter Connect for JMS and the JMS hosting WebLogic server do not need to be
on the same server. PowerCenter Connect for JMS just needs a URL, as long as the URL
points to the right place.
JNDI Connection
The Weblogic Server automatically provides a context factory and URL during the JNDI
set-up configuration for WebLogic Server. Enter these values to configure the JNDI
connection for JMS sources and targets in the Workflow Manager.
Enter the following value for JNDI Context Factory in the JNDI Application Connection in
the Workflow Manager:
weblogic.jndi.WLInitialContextFactory
Enter the following value for JNDI Provider URL in the JNDI Application Connection in
the Workflow Manager:
t3://<WebLogic_Server_hostname>:<port>
where WebLogic Server hostname is the hostname or IP address of the WebLogic
Server and port is the port number for the WebLogic Server.
JMS Connection
The JMS connection is configured from the BEA WebLogic Server console. Select JMS ->
Connection Factory.
The JMS Destination is also configured from the BEA Weblogic Server console.

From the Console pane, select Services > JMS > Servers > <JMS Server name> >
Destinations under your domain.
Click Configure a New JMSQueue or Configure a New JMSTopic.
The following table shows the JMS object types and the corresponding attributes in the
JMS application connection in the Workflow Manager:
WebLogic Server JMS Object JMS Application Connection Attribute
Connection Factory Settings: JNDIName JMS Application Connection Attribute
Connection Factory Settings: JNDIName JMS Connection Factory Name
Destination Settings: JNDIName JMS Destination
In addition to JNDI and JMS setting, BEA Weblogic also offers a function called JMS
Store, which can be used for persistent messaging when reading and writing JMS
messages. The JMS Stores configuration is available from the Console pane: select
Services > JMS > Stores under your domain.
Configuring the JNDI and JMS Connection for TIBCO
TIBCO Rendezvous Server does not adhere to JMS specifications. As a result,
PowerCenter Connect for JMS cant connect directly with the Rendezvous Server. TIBCO
Enterprise Server, which is JMS-compliant, acts as a bridge between the PowerCenter
Connect for JMS and TIBCO Rendezvous Server. Configure a connection-bridge between
TIBCO Rendezvous Server and TIBCO Enterpriser Server for PowerCenter Connect for
JMS to be able to read messages from and write messages to TIBCO Rendezvous
Server.
To create a connection-bridge between PowerCenter Connect for JMS and TIBCO
Rendezvous Server, follow these steps:
1. Configure PowerCenter Connect for JMS to communicate with TIBCO Enterprise
Server.
2. Configure TIBCO Enterprise Server to communi cate with TIBCO Rendezvous
Server.
Configure the following information in your JNDI application connection:
JNDI Context Factory.com.tibco.tibjms.naming.TibjmsInitialContextFactory
Provider URL.tibjmsnaming://<host>:<port> where host and port are the host
name and port number of the Enterprise Server.
To make a connection-bridge between TIBCO Rendezvous Server and TIBCO
Enterpriser Server:
1. In the file tibjmsd.conf, enable the tibrv transport configuration parameter as in
the example below, so that TIBCO Enterprise Server can communicate with
TIBCO Rendezvous messaging systems:
tibrv_transports = enabled

2. Enter the following transports in the transports.conf file:
[RV]
type = tibrv // type of external messaging system
topic_import_dm = TIBJMS_RELIABLE // only reliable/certified messages can
transfer
daemon = tcp:localhost:7500 // default daemon for the Rendezvous server
The transports in the transports.conf configuration file specify the
communication protocol between TIBCO Enterprise for JMS and the TIBCO
Rendezvous system. The import and export properties on a destination can list
one or more transports to use to communicate with the TIBCO Rendezvous
system.
3. Optionally, specify the name of one or more transports for reliable and certified
message delivery in the export property in the file topics.conf. as in the
following example:
topicname export="RV"
The export property allows messages published to a topic by a JMS client to be
exported to the external systems with configured transports. Currently, you can
configure transports for TIBCO Rendezvous reliable and certified messaging
protocols.
PowerCenter Connect for webMethods
When importing webMethods sources into the Designer, be sure the webMethods host
file doesnt contain . character. You cant use fully-qualified names for the connection
when importing webMethods sources. You can use fully-qualified names for the
connection when importing webMethods targets because PowerCenter doesnt use the
same grouping method for importing sources and targets. To get around this, modify
the host file to resolve the name to the IP address.
For example:
Host File:
crpc23232.crp.informatica.com crpc23232
Use crpc23232 instead of crpc23232.crp.informatica.com as the host name when
importing webMethods source definition. This step is only required for importing
PowerCenter Connect for webMethods sources into the Designer.
If you are using the request/reply model in webMethods, PowerCenter needs to send an
appropriate document back to the broker for every document it receives. PowerCenter
populates some of the envelope fields of the webMethods target to enable webMethods
broker to recognize that the published document is a reply from PowerCenter. The
envelope fields destid and tag are populated for the request/reply model. Destid
should be populated from the pubid of the source document and tag should be
populated from tag of the source document. Use the option Create Default Envelope

Fields when importing webMethods sources and targets into the Designer in order to
make the envelope fields available in PowerCenter.
Configuring the PowerCenter Connect for webMethods connection
To create or edit PowerCenter Connect for webMethods connection select Connections -
> Application -> webMethods Broker from the Workflow Manager.
PowerCenter Connect for webMethods connection attributes:
Name
Broker Host
Broker Name
Client ID
Client Group
Application Name
Automatic Reconnect
Preserve Client State
Enter the connection to the Broker Host in the following format <hostname: port>.
If you are using the request/reply method in webMethods, you have to specify a client
ID in the connection. Be sure that the client ID used in the request connection is the
same as the client ID used in the reply connection. Note that if you are using multiple
request/reply document pairs, you need to setup different webMethods connections for
each pair because they cannot share a client ID.
Setting Up Real-Time Session in PowerCenter
The PowerCenter real-time option uses a Zero Latency engine to process data from the
messaging system. Depending on the messaging systems and the application that
sends and receives messages, there may be a period when there are many messages
and, conversely, there may be a period when there are no messages. PowerCenter uses
the attribute Flush Latency to determine how often the messages are being flushed to
the target. PowerCenter also provides various attributes to control when the session
ends.
The following reader attributes determine when a PowerCenter session should end:
Message Count - Controls the number of messages the PowerCenter Server reads
from the source before the session stops reading from the source.
Idle Time - Indicates how long the PowerCenter Server waits when no messages
arrive before it stops reading from the source.
Time Slice Mode - Indicates a specific range of time the server read messages
from the source. Only PowerCenter Connect for MQSeries uses this option.
Reader Time Limit - Indicates the number of seconds the PowerCenter Server
spends reading messages from the source.
The specific filter conditions and options available to you depend on which PowerCenter
Connect you use.

For example: Attributes for JMS Reader

Set the attributes that controls the end of session. One or more attributes can be used
to control the end of session.
For example: set the MessageCount attributes to 10. The session will end after it reads
10 messages from the messaging system.
If more than one attribute is selected, the first attribute that satisfies the condition is
used to control the end of session.
Note: The real-time attributes can be found in the Reader Properties for PowerCenter
Connect for JMS, Tibco, Webmethods, and SAP Idoc. For PowerCenter Connect for MQ
Series, the real-time attributes must be specified as a filter condition.
The next step is to set the Real-time Flush Latency attribute. The Flush Latency defines
how often PowerCenter should flush messages, expressed in seconds.
For example, if the Real-time Flush Latency is set to 2, PowerCenter will flush messages
every two seconds. The messages will also be flushed from the reader buffer if the
Source Based Commit condition is reached. The Source Based Commit condition is
defined in the Properties tab of the session.
The message recovery option can be enabled to make sure no messages are lost if a
session fails as a result of unpredictable error, such as power loss. This is especially
important for real-time sessions because some messaging applications do not store the
messages after the messages are consumed by another application.
Executing a Real-Time Session

A real-time session often has to be up and running continuously to listen to the
messaging application and to process messages immediately after the messages arrive.
Set the reader attribute Idle Time to -1 and Flush Latency to a specific time interval.
This is applicable for all PowerCenter Connect versions except for PowerConnect for
MQSeries where the session will continue to run and flush the messages to the target
using the specific flush latency interval.
Another scenario is the ability to read data from another source system and send it to a
real-time target immediately. For example: Reading data from a relational source and
writing it to MQ Series. In this case, set the session to run continuously so that every
change in the source system can be immediately reflected in the target.
To set a workflow to run continuously, edit the workflow and select the Scheduler tab.
Edit the Scheduler and select Run Continuously from Run Options. A continuous
workflow starts automatically when the Load Manager starts. When the workflow stops,
it restarts immediately.
Real-Time Session and Active Transformation
Some of the transformations in PowerCenter are active transformations, which means
that the number of input rows and output rows of the transformations are not the
same. For most cases, active transformation requires all of the input rows to be
processed before processing the output row to the next transformation or target. For a
real-time session, the flush latency will be ignored if DTM needs to wait for all the rows
to be processed.
Depending on user needs, active transformations, such as aggregator, rank, sorter can
be used in a real-time session by setting the transaction scope property in the active
transformation to Transaction. This signals the session to process the data in the
transformation every transaction. For example, if a real-time session is using
aggregator that sums a field of an input, the summation will be done per transaction,
as opposed to all rows. The result may or may not be correct depending on the
requirement. Use the active transformation with real-time session if you want to
process the data per transaction.
Custom transformations can also be defined to handle data per transaction so that they
can be used in a real-time session.


Session and Data Partitioning
Challenge
Improving performance by identifying strategies for partitioning relational tables, XML,
COBOL and standard flat files, and by coordinating the interaction between sessions,
partitions, and CPUs. These strategies take advantage of the enhanced partitioning
capabilities in PowerCenter 6.0 and higher.
Description
On hardware systems that are under-utilized, you may be able to improve performance
by processing partitioned data sets in parallel in multiple threads of the same session
instance running onthe PowerCenter Server engine. However, parallel execution may
impair performance on over-utilized systems or systems with smaller I/O capacity.
In addition to hardware, consider these other factors when determining if a session is
an ideal candidate for partitioning: source and target database setup, target type,
mapping design, and certain assumptions that are explained in the following
paragraphs. (Use the Workflow Manager client tool to implement session partitioning
and see Chapter 13: Pipeline Partitioning in the Workflow Administration Guide for
additional information).
Assumptions
The following assumptions pertain to the source and target systems of a session that is
a candidate for partitioning. These factors can help to maximize the benefits that can
be achieved through partitioning.
Indexing has been implemented on the partition key when using a relational
source.
Source files are located on the same physical machine as the PowerCenter Server
process when partitioning flat files, COBOL, and XML, to reduce network
overhead and delay.
All possible constraints are dropped or disabled on relational targets.
All possible indexes are dropped or disabled on relational targets.
Table spaces and database partitions are properly managed on the target system.
Target files are written to same physical machine that hosts the PowerCenter
process, in order to reduce network overhead and delay.
Oracle External Loaders are utilized whenever possible

Follow these steps when considering partitioning:
First, determine if you should partition your session. Parallel execution benefits systems
that have the following characteristics:
Check Idle Time and Busy Percentage for each thread. This will give the high-
level information of the bottleneck point/points. In order to do this, open the session
log and look for messages starting with PETL_ under the RUN INFO FOR TGT LOAD
ORDER GROUP section. These PETL messages give the following details against the
Reader, Transformation, and Writer threads:
Total Run Time
Total Idle Time
Busy Percentage
Under utilized or intermittently used CPUs. To determine if this is the case, check
the CPU usage of your machine: UNIX - type VMSTAT 1 10 on the command line. The
column ID displays the percentage utilization of CPU idling during the specified interval
without any I/O wait. If there are CPU cycles available (twenty percent or more idle
time) then this session's performance may be improved by adding a partition.
NT - check the task manager performance tab.
Sufficient I/O. To determine the I/O statistics:
UNIX - type IOSTAT on the command line. The column %IOWAIT displays the
percentage of CPU time spent idling while waiting for I/O requests. The column
%idle displays the total percentage of the time that the CPU spends idling (i.e.,
the unused capacity of the CPU.)
Sufficient memory. If too much memory is allocated to your session, you will receive
a memory allocation error. Check to see that you're using as much memory as you can.
If the session is paging, increase the memory. To determine if the session is paging:
UNIX - type VMSTAT 1 10 on the command line. PI displays number of pages
swapped in from the page space during the specified interval. PO displays the
number of pages swapped out to the page space during the specified interval. If
these values indicate that paging is occurring, it may be necessary to allocate
more memory, if possible.
If you determine that partitioning is practical, you can begin setting up the partition.
The following are selected hints for session setup; see the Workflow Administration
Guide for further directions on setting up partitioned sessions.
Partition Types
PowerCenter v6.x and higher provides increased control of the pipeline threads. Session
performance can be improved by adding partitions at various pipeline partition points.
When you configure the partitioning information for a pipeline, you must specify a

partition type. The partition type determines how the PowerCenter Server redistributes
data across partition points. The Workflow Manager allows you to specify the following
partition types:
Round-robin partitioning

The PowerCenter Server distributes data evenly among all partitions. Use round-robin
partitioning when you need to distribute rows evenly and do not need to group data
among partitions.
In a pipeline that reads data from file sources of different sizes, use round-robin
partitioning. For example, consider a session based on a mapping that reads data from
three flat files of different sizes.
Source file 1: 100,000 rows
In this scenario, the recommended best practice is to set a partition point after the
Source Qualifier and set the partition type to round-robin. The PowerCenter Server
distributes the data so that each partition processes approximately one third of the
data.
Hash partitioning

The PowerCenter Server applies a hash function to a partition key to group data among
partitions.
Use hash partitioning where you want to ensure that the PowerCenter Server processes
groups of rows with the same partition key in the same partition. For example, in a
scenario where you need to sort items by item ID, but do not know the number of
items that have a particular ID number. If you select hash auto-keys, the PowerCenter
Server uses all grouped or sorted ports as the partition key. If you select hash user
keys, you specify a number of ports to form the partition key.
An example of this type of partitioning is when you are using Aggregators and need to
ensure that groups of data based on a primary key are processed in the same
partition.
Key range partitioning

With this type of partitioning, you specify one or more ports to form a compound
partition key for a source or target. The PowerCenter Server then passes data to each
partition depending on the ranges you specify for each port.

Use key range partitioning where the sources or targets in the pipeline are partitioned
by key range. Refer to Workflow Administration Guide for further directions on setting
up Key range partitions.
For example, with key range partitioning set at End range = 2020, the PowerCenter
Server will pass in data where values are less than 2020. Similarly, for Start range =
2020, the PowerCenter Server will pass in data where values are equal to greater than
2020. Null values or values that might not fall in either partition will be passed through
the first partition.
Pass-through partitioning

In this type of partitioning, the PowerCenter Server passes all rows at one partition
point to the next partition point without redistributing them.
Use pass-through partitioning where you want to create an additional pipeline stage to
improve performance, but do not want to (or cannot) change the distribution of data
across partitions. Refer to Workflow Administration Guide (Version 6.0) for further
directions on setting up pass-through partitions.
The Data Transformation Manager spawns a master thread on each session run, which
in itself creates three threads (reader, transformation, and writer threads) by default.
Each of these threads can, at the most, process one data set at a time and hence three
data sets simultaneously. If there are complex transformations in the mapping, the
transformation thread may take a longer time than the other threads, which can slow
data throughput.
It is advisable to define partition points at these transformations. This creates another
pipeline stage and reduces the overhead of a single transformation thread.
When you have considered all of these factors and selected a partitioning strategy, you
can begin the iterative process of adding partitions. Continue adding partitions to the
session until you meet the desired performance threshold or observe degradation in
performance.
Tips for Efficient Session and Data Partitioning
Add one partition at a time. To best monitor performance, add one partition at
a time, and note your session settings before adding additional partitions. Refer
to Workflow Administrator Guide, for more information on Restrictions on the
Number of Partitions.
Set DTM buffer memory. For a session with n partitions, set this value to at
least n times the original value for the non-partitioned session.
Set cached values for sequence generator. For a session with n partitions,
there is generally no need to use the Number of Cached Values property of the
sequence generator. If you must set this value to a value greater than zero,
make sure it is at least n times the original value for the non-partitioned session.
Partition the source data evenly. The source data should be partitioned into
equal sized chunks for each partition.

Partition tables. A notable increase in performance can also be realized when the
actual source and target tables are partitioned. Work with the DBA to discuss
the partitioning of source and target tables, and the setup of tablespaces.
Consider using external loader. As with any session, using an external loader
may increase session performance. You can only use Oracle external loaders for
partitioning. Refer to the Session and Server Guide for more information on
using and setting up the Oracle external loader for partitioning.
Write throughput. Check the session statistics to see if you have increased the
write throughput.
Paging. Check to see if the session is now causing the system to page. When you
partition a session and there are cached lookups, you must make sure that DTM
memory is increased to handle the lookup caches. When you partition a source
that uses a static lookup cache, the PowerCenter Server creates one memory
cache for each partition and one disk cache for each transformation. Thus,
memory requirements grow for each partition. If the memory is not bumped up,
the system may start paging to disk, causing degradation in performance.
When you finish partitioning, monitor the session to see if the partition is
degrading or improving session performance. If the session performance is
improved and the session meets your requirements, add another partition


Using Parameters, Variables and Parameter Files
Challenge
Understanding how parameters, variables, and parameter files work and using them for
maximum efficiency.
Description
Prior to the release of PowerCenter 5, the only variables inherent to the product were
defined to specific transformations and to those server variables that were global in
nature. Transformation variables were defined as variable ports in a transformation and
could only be used in that specific transformation object (e.g., Expression, Aggregator,
and Rank transformations). Similarly, global parameters defined within Server Manager
would affect the subdirectories for source files, target files, log files, and so forth.
PowerCenter 5 made variables and parameters available across the entire mapping
rather than for a specific transformation object. In addition, it provides built-in
parameters for use within Server Manager. Using parameter files, these values can
change from session-run to session-run. Subsequently PowerCenter 6 built upon this
capability by adding several additional features. The concept is tailored to the new
functionality available in this release.
Parameters and Variables
Use a parameter file to define the values for parameters and variables used in a
workflow, worklet, mapping, or session. A parameter file can be created by using a text
editor such as WordPad or Notepad. List the parameters or variables and their values in
the parameter file. Parameter files can contain the following types of parameters and
variables:
Workflow variables
Worklet variables
Session parameters
Mapping parameters and variables
When using parameters or variables in a workflow, worklet, mapping, or session, the
PowerCenter Server checks the parameter file to determine the start value of the
parameter or variable. Use a parameter file to initialize workflow variables, worklet
variables, mapping parameters, and mapping variables. If not defining start values for

these parameters and variables, the PowerCenter Server checks for the start value of
the parameter or variable in other places.
Session parameters must be defined in a parameter file. Since session parameters do
not have default values, when the PowerCenter Server cannot locate the value of a
session parameter in the parameter file, it fails to initialize the session. To include
parameter or variable information for more than one workflow, worklet, or session in a
single parameter file, create separate sections for each object within the parameter file.
Also, create multiple parameter files for a single workflow, worklet, or session and
change the file that these tasks use, as necessary. To specify the parameter file that
the PowerCenter Server uses with a workflow, worklet, or session, do either of the
following:
Enter the parameter file name and directory in the workflow, worklet, or session
properties.
Start the workflow, worklet, or session using pmcmd and enter the parameter
filename and directory in the command line.
If entering a parameter file name and directory in the workflow, worklet, or session
properties and in the pmcmd command line, the PowerCenter Server uses the
information entered in the pmcmd command line.
Parameter File Format
The format for parameter files changed in version 6 to reflect the improved functionality
and nomenclature of the Workflow Manager. When entering values in a parameter file,
precede the entries with a heading that identifies the workflow, worklet, or session
whose parameters and variables that are to be assigned. Assign individual parameters
and variables directly below this heading, entering each parameter or variable on a new
line. List parameters and variables in any order for each task.
The following heading formats can be defined:
Workflow variables:
[folder name.WF:workflow name]
Worklet variables:
[folder name.WF:workflow name.WT:worklet name]
Worklet variables in nested worklets:
[folder name.WF:workflow name.WT:worklet name.WT:worklet name...]
Session parameters, plus mapping parameters and variables:
[folder name.WF:workflow name.ST:session name] or

[folder name.session name] or
[session name]
Below each heading, define parameter and variable values as follows:
parameter name=value
parameter2 name=value
variable name=value
variable2 name=value
For example, a session in the Production folder, s_MonthlyCalculations, uses a string
mapping parameter, $$State, that needs to be set to MA, and a datetime mapping
variable, $$Time. $$Time already has an initial value of 9/30/2000 00:00:00 saved in
the repository, but this value needs to be overridden to 10/1/2000 00:00:00. The
session also uses session parameters to connect to source files and target databases,
as well as to write session log to the appropriate session log file.
The following table shows the parameters and variables that will be defined in the
parameter file:
Parameters and Variables in Parameter File
Parameter and
Variable Type
Parameter and
Variable Name
Desired Definition
String Mapping
Parameter
$$State MA
Datetime
Mapping Variable
$$Time
10/1/2000
00:00:00
Source File
(Session
Parameter)
$InputFile1 Sales.txt
Database
Connection
(Session
Parameter)
$DBConnection_Target
Sales (database
connection)
Session Log File
(Session
Parameter)
$PMSessionLogFile
d:/session
logs/firstrun.txt
The parameter file for the session includes the folder and session name, as well as each
parameter and variable:
[Production.s_MonthlyCalculations]
$$State=MA
$$Time=10/1/2000 00:00:00
$InputFile1=sales.txt
$DBConnection_target=sales
$PMSessionLogFile=D:/session logs/firstrun.txt

The next time the session runs, edit the parameter file to change the state to MD and
delete the $$Time variable. This allows the PowerCenter Server to use the value for the
variable that was set in the previous session run.
Mapping Variables
Declare mapping variables in PowerCenter Designer using the menu option Mappings -
> Parameters and Variables. After selecting mapping variables, use the pop-up
window to create a variable by specifying its name, data type, initial value, aggregation
type, precision, and scale. This is similar to creating a port in most transformations.
Variables, by definition, are objects that can change value dynamically. PowerCenter
has four functions to affect change to mapping variables:
SetVariable
SetMaxVariable
SetMinVariable
SetCountVariable
A mapping variable can store the last value from a session run in the repository to be
used as the starting value for the next session run.
Name
The name of the variable should be descriptive and be preceded by $$ (so that it is
easily identifiable as a variable). A typical variable name is: $$Procedure_Start_Date.
Aggregation type
This entry creates specific functionality for the variable and determines how it stores
data. For example, with an aggregation type of Max, the value stored in the repository
at the end of each session run would be the max value across ALL records until the
value is deleted.
Initial value
This value is used during the first session run when there is no corresponding and
overriding parameter file. This value is also used if the stored repository value is
deleted. If no initial value is identified, then a data-type specific default value is used.
Variable values are not stored in the repository when the session:
Fails to complete.
Is configured for a test load.
Is a debug session.
Runs in debug mode and is configured to discard session output.
Order of evaluation

The start value is the value of the variable at the start of the session. The start value
can be a value defined in the parameter file for the variable, a value saved in the
repository from the previous run of the session, a user-defined initial value for the
variable, or the default value based on the variable data type.
The PowerCenter Server looks for the start value in the following order:
1. Value in session parameter file
2. Value saved in the repository
3. Initial value
4. Default value
Mapping parameters and variables
Since parameter values do not change over the course of the session run, the value
used is based on:
Value in session parameter file
Initial value
Default value
Once defined, mapping parameters and variables can be used in the Expression Editor
section of the following transformations:
Expression
Filter
Router
Update Strategy
Mapping parameters and variables also can be used within the Source Qualifier in the
SQL query, user-defined join, and source filter sections, as well as in a SQL override in
the lookup transformation.
The lookup SQL override is similar to entering a custom query in a Source Qualifier
transformation. When entering a lookup SQL override, enter the entire override, or
generate and edit the default SQL statement. When the Designer generates the default
SQL statement for the lookup SQL override, it includes the lookup/output ports in the
lookup condition and the lookup/return port.
Note: Although you can use mapping parameters and variables when entering a lookup
SQL override, the Designer cannot expand mapping parameters and variables in the
query override and does not validate the lookup SQL override. When running a session
with a mapping parameter or variable in the lookup SQL override, the PowerCenter
Server expands mapping parameters and variables and connects to the lookup
database to validate the query override.
Also note that Workflow Manager does not recognize variable connection parameters
such as dbconnection with lookup transformations. At this time, Lookups can use
$Source, $Target, or exact db connections.
Guidelines for Creating Parameter Files

Use the following guidelines when creating parameter files:
Capitalize folder and session names as necessary. Folder and session names
are case-sensitive in the parameter file.
Enter folder names for non-unique session names. When a session name
exists more than once in a repository, enter the folder name to indicate the
location of the session.
Create one or more parameter files. Assign parameter files to workflows,
worklets, and sessions individually. Specify the same parameter file for all of
these tasks or create several parameter files.
If including parameter and variable information for more than one session
in the file, create a new section for each session as follows. The folder
name is optional.
[folder_name.session_name]
parameter_name=value
variable_name=value
mapplet_name.parameter_name=value
[folder2_name.session_name]
parameter_name=value
variable_name=value
Specify headings in any order. Place headings in any order in the parameter
file. However, if defining the same parameter or variable more than once in the
file, the PowerCenter Server assigns the parameter or variable value using the
first instance of the parameter or variable.
Specify parameters and variables in any order. Below each heading, the
parameters and variables can be specified in any order.
When defining parameter values, do not use unnecessary line breaks or
spaces. The PowerCenter Server may interpret additional spaces as part of the
value.
List all necessary mapping parameters and variables. Values entered for
mapping parameters and variables become the start value for parameters and
variables in a mapping. Mapping parameter and variable names are not case
sensitive.
List all session parameters. Session parameters do not have default values. An
undefined session parameter can cause the session to fail. Session parameter
names are not case sensitive.
Use correct date formats for datetime values. When entering datetime values,
use the following date formats:
MM/DD/RR

MM/DD/RR HH24:MI:SS
MM/DD/YYYY
MM/DD/YYYY HH24:MI:SS
Do not enclose parameters or variables in quotes. The PowerCenter Server
interprets everything after the equal sign as part of the value.
Precede parameters and variables created in mapplets with the mapplet
name as follows:
mapplet2_name.variable_name=value
Example: Parameter files and session parameters
Parameter files, along with session parameters, allow you to change certain values
between sessions. A commonly used feature is the ability to create user-defined
database connection session parameters to reuse sessions for different relational
sources or targets. Use session parameters in the session properties, and then define
the parameters in a parameter file. To do this, name all database connection session
parameters with the prefix $DBConnection, followed by any alphanumeric and
underscore characters as shown in the previous example where
DBConnection_target=sales. Instead of relational connections, it can also be used for
source files. Session parameters and parameter files help reduce the overhead of
creating multiple mappings when only certain attributes of a mapping need to be
changed, as shown in the examples above.
Using Parameters in Source Qualifiers
Another commonly used feature is the ability to create parameters in the source
qualifiers, which allows you to reuse the same mapping, with different sessions, to
extract specified data from the parameter files the sessions references.
Moreover, there may be a time when it is necessary to create a mapping that will
create a parameter file and the second mapping to use that parameter file created from
the first mapping. The second mapping will pull the data using a parameter in the
Source Qualifier transformation, which reads the parameter from the parameter file
created in the first mapping. In the first case, the idea is to build a mapping that
creates the flat file, which is a parameter file for another session to use.
Note: Server variables cannot be modified by entries in the parameter file. For
example, there is no way to set the Workflow log directory in a parameter file. The
Workflow Log File Directory can only accept an actual directory or the
$PMWorkflowLogDir variable as a valid entry. The $PMWorkflowLogDir variable is a
server variable that is set at the server configuration level, not in the Workflow
parameter file.
Example: Variables and Parameters in an Incremental Strategy

Variables and parameters can enhance incremental strategies. The following example
uses a mapping variable, an expression transformation object, and a parameter file for
restarting.
Scenario
Company X wants to start with an initial load of all data, but wants subsequent process
runs to select only new information. The environment data has an inherent Post_Date
that is defined within a column named Date_Entered that can be used. Process will run
once every twenty-four hours.
Sample Solution
Create a mapping with source and target objects. From the menu create a new
mapping variable named $$Post_Date with the following attributes:
TYPE Variable
DATATYPE Date/Time
AGGREGATION TYPE MAX
INITIAL VALUE 01/01/1900

Note that there is no need to encapsulate the INITIAL VALUE with quotation marks.
However, if this value is used within the Source Qualifier SQL, it is necessary to use the
native RDBMS function to convert (e.g., TO DATE(--,--)). Within the Source Qualifier
Transformation, use the following in the Source_Filter Attribute: DATE_ENTERED >
to_Date(' $$Post_Date','MM/DD/YYYY HH24:MI:SS')
Also note that the initial value 01/01/1900 will be expanded by the PowerCenter Server
to 01/01/1900 00:00:00, hence the need to convert the parameter to a datetime.
The next step is to $$Post_Date and Date_Entered to an Expression transformation.
This is where the function for setting the variable will reside. An output port named

Post_Date is created with a data type of date/time. In the expression code section,
place the following function:
SETMAXVARIABLE($$Post_Date,DATE_ENTERED)

The function evaluates each value for DATE_ENTERED and updates the variable with
the Max value to be passed forward. For example:
DATE_ENTERED Resultant POST_DATE
9/1/2000 9/1/2000
10/30/2001 10/30/2001
9/2/2000 10/30/2001
Consider the following with regard to the functionality:
1. In order for the function to assign a value, and ultimately store it in the
repository, the port must be connected to a downstream object. It need not go
to the target, but it must go to another Expression Transformation. The reason
is that the memory will not be instantiated unless it is used in a downstream
transformation object.
2. In order for the function to work correctly, the rows have to be marked for
insert. If the mapping is an update-only mapping (i.e., Treat Rows As is set to
Update in the session properties) the function will not work. In this case, make
the session Data Driven and add an Update Strategy after the transformation
containing the SETMAXVARIABLE function, but before the Target.
3. If the intent is to store the original Date_Entered per row and not the evaluated
date value, then add an ORDER BY clause to the Source Qualifier. This way, the
dates are processed and set in order and data is preserved.


The first time this mapping is run, the SQL will select from the source where
Date_Entered is > 01/01/1900 providing an initial load. As data flows through the
mapping, the variable gets updated to the Max Date_Entered it encounters. Upon
successful completion of the session, the variable is updated in the repository for use in
the next session run. To view the current value for a particular variable associated with
the session, right-click on the session and choose View Persistent Values.
The following graphic shows that after the initial run, the Max Date_Entered was
02/03/1998. The next time this session is run, based on the variable in the Source
Qualifier Filter, only sources where Date_Entered > 02/03/1998 will be processed.

Resetting or overriding persistent values
To reset the persistent value to the initial value declared in the mapping, view the
persistent value from Server Manager (see graphic above) and press Delete Values.
This will delete the stored value from the repository, causing the Order of Evaluation to
use the Initial Value declared from the mapping.
If a session run is needed for a specific date, use a parameter file. There are two basic
ways to accomplish this:
Create a generic parameter file, place it on the server, and point all sessions to
that parameter file. A session may (or may not) have a variable, and the
parameter file need not have variables and parameters defined for every session
using the parameter file. To override the variable, either change, uncomment, or
delete the variable in the parameter file.
Run PMCMD for that session but declare the specific parameter file within the
PMCMD command.

Configuring the parameter file location
Specify the parameter filename and directory in the workflow or session properties. To
enter a parameter file in the workflow or session properties:
Select either the Workflow or Session, choose, Edit, and click the Properties tab.
Enter the parameter directory and name in the Parameter Filename field.
Enter either a direct path or a server variable directory. Use the appropriate
delimiter for the Informatica Server operating system.
The following graphic shows the parameter filename and location specified in the
session task.

The next graphic shows the parameter filename and location specified in the Workflow.


In this example, after the initial session is run the parameter file contents may look
like:
[Test.s_Incremental]
;$$Post_Date=
By using the semicolon, the variable override is ignored and the Initial Value or Stored
Value is used. If, in the subsequent run, the data processing date needs to be set to a
specific date (for example: 04/21/2001), then a simple Perl script or manual change
can update the parameter file to:
[Test.s_Incremental]
$$Post_Date=04/21/2001
Upon running the sessions, the order of evaluation looks to the parameter file first, sees
a valid variable and value and uses that value for the session run. After successful
completion, run another script to reset the parameter file.
Example: Using session and mapping parameters in multiple database
environments
Reusable mappings that can source a common table definition across multiple
databases, regardless of differing environmental definitions (e.g., instances, schemas,
user/logins), are required in a multiple database environment.

Scenario
Company X maintains five Oracle database instances. All instances have a common
table definition for sales orders, but each instance has a unique instance name,
schema, and login.
DB Instance Schema Table User Password
ORC1 aardso orders Sam max
ORC99 environ orders Help me
HALC hitme order_done Hi Lois
UGLY snakepit orders Punch Judy
GORF gmer orders Brer Rabbit
Each sales order table has a different name, but the same definition:
ORDER_ID NUMBER (28) NOT NULL,
DATE_ENTERED DATE NOT NULL,
DATE_PROMISED DATE NOT NULL,
DATE_SHIPPED DATE NOT NULL,
EMPLOYEE_ID NUMBER (28) NOT NULL,
CUSTOMER_ID NUMBER (28) NOT NULL,
SALES_TAX_RATE NUMBER (5,4) NOT NULL,
STORE_ID NUMBER (28) NOT NULL
Sample Solution
Using Workflow Manager, create multiple relational connections. In this example, the
strings are named according to the DB Instance name. Using Designer, create the
mapping that sources the commonly defined table. Then create a Mapping Parameter
named $$Source_Schema_Table with the following attributes:

Note that the parameter attributes vary based on the specific environment. Also, the
initial value is not required as this solution will use parameter files.
Open the Source Qualifier and use the mapping parameter in the SQL Override as
shown in the following graphic.

Open the Expression Editor and select Generate SQL. The generated SQL statement will
show the columns. Override the table names in the SQL statement with the mapping
parameter.
Using Workflow Manager, create a session based on this mapping. Within the Source
Database connection drop down box, choose the following parameter:
$DBConnection_Source.
Point the target to the corresponding target and finish.
Now create the parameter files. In this example, there will be five separate parameter
files.
Parmfile1.txt
[Test.s_Incremental_SOURCE_CHANGES]
$$Source_Schema_Table=aardso.orders

$DBConnection_Source= ORC1
Parmfile2.txt
$$Source_Schema_Table=environ.orders
$DBConnection_Source= ORC99
Parmfile3.txt
$$Source_Schema_Table=hitme.order_done
$DBConnection_Source= HALC
Parmfile4.txt
$$Source_Schema_Table=snakepit.orders
$DBConnection_Source= UGLY
Parmfile5.txt
$$Source_Schema_Table= gmer.orders
$DBConnection_Source= GORF
Use PMCMD to run the five sessions in parallel. The syntax for PMCMD for starting
sessions is as follows:
pmcmd startworkflow -s serveraddress:portno -u Username -p Password s_Incremental
Notes on Using Parameter Files with Startworkflow
When starting a workflow, you can optionally enter the directory and name of a
parameter file. The PowerCenter Server runs the workflow using the parameters in the
file specified.
For UNIX shell users, enclose the parameter file name in single quotes:
-paramfile '$PMRootDir/myfile.txt'

For Windows command prompt users, the parameter file name cannot have beginning
or trailing spaces. If the name includes spaces, enclose the file name in double quotes:
-paramfile $PMRootDir\my file.txt
Note: When writing a pmcmd command that includes a parameter file located on
another machine, use the backslash (\) with the dollar sign ($). This ensures that the
machine where the variable is defined expands the server variable.
pmcmd startworkflow -uv USERNAME -pv PASSWORD -s SALES:6258 -f east -w
wSalesAvg -paramfile '\$PMRootDir/myfile.txt'
In the event that it is necessary to run the same workflow with different parameter
files, use the following five separate commands:
pmcmd startworkflow tech_user pwd 127.0.0.1:4001 Test
s_Incremental_SOURCE_CHANGES paramfile \$PMRootDir\ParmFiles\Parmfile1.txt 1 1
Alternatively, run the sessions in sequence with one parameter file. In this case, a pre-
or post-session script would change the parameter file for the next session.


Using PowerCenter Labels
Challenge
Using labels effectively in a data warehouse or data integration project to assist with
administration and migration.
Description
A label is a versioning object that can be associated with any versioned object or group
of versioned objects in a repository. Labels provide a way to tag a number of object
versions with a name for later identification. Therefore, a label is a named object in the
repository, whose purpose is to be a pointer or reference to a group of versioned
objects. For example, a label called Project X version X can be applied to all object
versions that are part of that project and release.
Labels can be used for many purposes:
Track versioned objects during development
Improve object query results.
Create logical groups of objects for future deployment.
Associate groups of objects for import and export.
Note that labels apply to individual object versions, and not objects as a whole. So if a
mapping has ten versions checked in, and a label is applied to version 9, then only
version 9 has that label. The other versions of that mapping do not automatically inherit
that label. However, multiple labels can point to the same object for greater flexibility.
The Use Repository Manager privilege is required in order to create or edit labels, To
create a label, choose Versioning-Labels from the Repository Manager.


When creating a new label, choose a name that is as descriptive as possible. For
example, a suggested naming convention for labels is: Project_Version_Action. Include
comments for further meaningful description.
Locking the label is also advisable. This prevents anyone from accidentally associating
additional objects with the label or removing object references for the label.
Labels, like other global objects such as Queries and Deployment Groups, can have
user and group privileges attached to them. This allows an administrator to create a
label that can only be used by specific individuals or groups. Only those people working
on a specific project should be given read/write/execute permissions for labels that are
assigned to that project.

Once a label is created, it should be applied to related objects. To apply the label to
objects, invoke the Apply Label wizard from the Versioning >> Apply Label menu
option from the menu bar in the Repository Manager (as shown in the following figure).


Applying Labels
Labels can be applied to any object and cascaded upwards and downwards to parent
and/or child objects. For example, to group dependencies for a workflow, apply a label
to all children objects. The Repository Server applies labels to sources, targets,
mappings, and tasks associated with the workflow. Use the Move label property to
point the label to the latest version of the object(s).
Note: Labels can be applied to any object version in the repository except checked-out
versions. Execute permission is required for applying labels.
After the label has been applied to related objects, it can be used in queries and
deployment groups (see the Best Practice on Deployment Groups ). Labels can also be
used to manage the size of the repository (i.e. to purge object versions).
Using labels in deployment
An object query can be created using the existing labels (as shown below). Labels can
be associated only with a dynamic deployment group. Based on the object query,
objects associated with that label can be used in the deployment.


Strategies for Labels
Repository Administrators and other individuals in charge of migrations should develop
their own label strategies and naming conventions in the early stages of a data
integration project. Be sure that developers are aware of the uses of these labels and
when they should apply labels.
For each planned migration between repositories, choose three labels for the
development and subsequent repositories:
The first is to identify the objects that developers can mark as ready for migration.
The second should apply to migrated objects, thus developing a migration audit
trail.
The third is to apply to objects as they are migrated into the receiving repository,
completing the migration audit trail.
When preparing for the migration, use the first label to construct a query to build a
dynamic deployment group. The second and third labels in the process are optionally
applied by the migration wizard when copying folders between versioned repositories.
Developers and administrators do not need to apply the second and third labels
manually.

Additional labels can be created with developers to allow the progress of mappings to
be tracked if desired. For example, when an object is successfully unit-tested by the
developer, it can be marked as such. Developers can also label the object with a
migration label at a later time if necessary. Using labels in this fashion along with the
query feature allows complete or incomplete objects to be identified quickly and easily,
thereby providing an object-based view of progress.


Using PowerCenter Metadata Reporter and Metadata
Exchange Views for Quality Assurance
Challenge
The principal objectives of any QA strategy are to ensure that developed components
adhere to standards and to identify defects before incurring overhead during the
migration from development to test/production environments. Qualitative, peer-based
reviews of PowerCenter objects due for release obviously have their part to play in this
process.
Less well-appreciated is the role that the PowerCenter repository can play in an
automated QA strategy. This repository is essentially a database about the
transformation process and the software developed to implement it; the challenge is to
devise a method to exploit this resource for QA purposes.
Description
Before considering the mechanics of an automated QA strategy it is worth emphasizing
that quality should be built in from the outset. If the project involves multiple mappings
repeating the same basic transformation pattern(s), it is probably worth constructing a
virtual production line. This is essentially a template-driven approach to accelerate
development and enforce consistency through the use of the following aids:
shared template for each type of mapping
checklists to guide the developer through the process of adapting the template to
the mapping requirements
macros/scripts to generate productivity aids such as SQL overrides etc.
It is easier to ensure quality from a standardized base rather than relying on developers
to repeat accurately the same basic keystrokes.
Underpinning the exploitation of the repository for QA purposes is the adoption of
naming standards which categorize components. By running the appropriate query on
the repository, it is possible to identify those components whose attributes differ from
those predicted for the category. Thus, it is quite possible to automate some aspects of
QA. Clearly, the function of naming conventions is not just to standardize but also to
provide logical access paths into the information in the repository; names can be used
to identify patterns and/or categories and thus allow assumptions to be made about
object attributes. Together with the supported facilities provided to query the

repository, such as the Metadata Exchange (MX) Views and the PowerCenter Metadata
Reporter,, this opens the door to an automated QA strategy.
For example, consider the following situation: it is possible that the EXTRACT
mapping/session should always truncate the target table before loading; conversely,
the TRANSFORM and LOAD phases should never truncate a target.
Possible code errors in this respect could be identified as follows :
Define a mapping/session naming standard to indicate EXTRACT, TRANSFORM, or
LOAD.
Develop a query on the repository to search for sessions named EXTRACT, which
do not have the truncate target option set.
Develop a query on the repository to search for sessions named TRANSFORM or
LOAD, which do have the truncate target option set.
Provide a facility to allow developers to run both queries before releasing code to
the test environment.
Alternatively, a standard may have been defined to prohibit unconnected output ports
from transformations (such as expressions) in a mapping. These can be very easily
identified from the MX View REP_MAPPING_UNCONN_PORTS.
The following bullets represent a high level overview of the steps involved in
automating QA:
Review the transformations/mappings/sessions/workflows and allocate to broadly
representative categories.
Identify the key attributes of each category.
Define naming standards to identify the category for each
transformations/mappings/sessions/workflows
Analyze the MX Views to source the key attributes.
Develop the query to compare actual and expected attributes for each category.
After you have completed these steps, it is possible to develop a utility that compares
actual and expected attributes for developers to run before releasing code into any test
environment. Such a utility may incorporate the following processing stages:
Execute a profile to assign environment variables (repository schema user,
password etc).
Select the folder to be reviewed.
Execute the query to find exceptions.
Report the exceptions in an accessible format.
Exit with failure if exceptions are found.
TIP
Remember that any queries on the repository that bypass the MX views will
require modification if subsequent upgrades to PowerCenter occur and as
such is not recommended by Informatica.


Using PowerCenter with UDB
Challenge
Universal Database (UDB) is a database platform that can be used to run PowerCenter
repositories and act as source and target databases for PowerCenter mappings. Like
any software, it has its own way of doing things. It is important to understand these
behaviors so as to configure the environment correctly for implementing PowerCenter
and other Informatica products with this database platform. This Best Practice offers a
number of tips for using UDB with PowerCenter.
Description

UDB Overview
UDB is used for a variety of purposes and with various environments. UDB servers run
on Windows, OS/2, AS/400 and UNIX-based systems like AIX, Solaris, and HP-UX. UDB
supports two independent types of parallelism: symmetric multi-processing (SMP) and
massively parallel processing (MPP).
Enterprise-Extended Edition (EEE) is the most common UDB edition used in conjunction
with the Informatica product suite. UDB EEE introduces a dimension of parallelism that
can be scaled to very high performance. A UDB EEE database can be partitioned across
multiple machines that are connected by a network or a high-speed switch. Additional
machines can be added to an EEE system as application requirements grow. The
individual machines participating in an EEE installation can be either uniprocessors or
symmetric multiprocessors.
Connection Setup
You must set up a remote database connection to connect to DB2 UDB via
PowerCenter. This is necessary because DB2 UDB sets a very small limit on the number
of attachments per user to the shared memory segments when the user is using the
local (or indirect) connection/protocol. The PowerCenter server runs into this limit when
it is acting as the database agent or user. This is especially apparent when the
repository is installed on DB2 and the target data source is on the same DB2 database.
The local protocol limit will definitely be reached when using the same connection node
for the repository via the PowerCenter Server and for the targets. This occurs when the

session is executed and the server sends requests for multiple agents to be launched.
Whenever the limit on number of database agents is reached, the following error
occurs:
CMN_1022 [[IBM][CLI Driver] SQL1224N A database agent could not be started to
service a request, or was terminated as a result of a database system shutdown or a
force command. SQLSTATE=55032]
The following recommendations may resolve this problem:
Increase the number of connections permitted by DB2.
Catalog the database as if it were remote. (For information of how to catalog
database with remote node refer Knowledgebase id 14745 at
my.Informatica.com support Knowledgebase)
Be sure to close connections when programming exceptions occur.
Verify that connections obtained in one method are returned to the pool via close()
(The PowerCenter Server is very likely already doing this).
Verify that your application does not try to access pre-empted connections (i.e.,
idle connections that are now used by other resources).

DB2 Timestamp
DB2 has a timestamp data type that is precise to the microsecond and uses a 26-
character format, as follows:
YYYY-MM-DD-HH.MI.SS.MICROS (where MICROS after the last period recommends six
decimals places of second)
The PowerCenter Date/Time datatype only supports precision to the second (using a 19
character format), so under normal circumstances when a timestamp source is read
into PowerCenter, the six decimal places after the second are lost. This is sufficient for
most data warehousing applications but can cause significant problems where this
timestamp is used as part of a key.
If the MICROS need to be retained, this can be accomplished by changing the format of
the column from a timestamp data type to a character 26 in the source and target
definitions. When the timestamp is read from DB2, the timestamp will be read in and
converted to character in the YYYY-MM-DD-HH.MI.SS.MICROS format. Likewise, when
writing to a timestamp, pass the date as a character in the YYYY-MM-DD-
HH.MI.SS.MICROS format. If this format is not retained, the records are likely to be
rejected due to an invalid date format error.
It is also possible to maintain the timestamp correctly using the timestamp data type
itself. Setting a flag at the PowerCenter Server level does this; the technique is
described in Knowledge Base article 10220 at my.Informatica.com.
Importing Sources or Targets

If the value of the DB2 system variable APPLHEAPSZ is too small when you use the
Designer to import sources/targets from a DB2 database, the Designer reports an error
accessing the repository. The Designer status bar displays the following message:
SQL Error:[IBM][CLI Driver][DB2]SQL0954C: Not enough storage is available in the
application heap to process the statement.
If you receive this error, increase the value of the APPLHEAPSZ variable for your DB2
operating system. APPLHEAPSZ is the application heap size (in 4KB pages) for each
process using the database.
Unsupported Datatypes
PowerMart and PowerCenter do not support the following DB2 datatypes:
Dbclob
Blob
Clob
Real

DB2 External Loaders
The DB2 EE and DB2 EEE external loaders can both perform insert and replace
operations on targets. Both can also restart or terminate load operations.
The DB2 EE external loader invokes the db2load executable located in the
PowerCenter Server installation directory. The DB2 EE external loader can load
data to a DB2 server on a machine that is remote to the PowerCenter Server.
The DB2 EEE external loader invokes the IBM DB2 Autoloader program to load
data. The Autoloader program uses the db2atld executable. The DB2 EEE
external loader can partition data and load the partitioned data simultaneously
to the corresponding database partitions. When you use the DB2 EEE external
loader, the PowerCenter Server and theDB2 EEE server must be on the same
machine.
The DB2 external loaders load from a delimited flat file. Be sure that the target table
columns are wide enough to store all of the data. If you configure multiple targets in
the same pipeline to use DB2 external loaders, each loader must load to a different
tablespace on the target database. For information on selecting external loaders, see
Configuring External Loading in a Session in the PowerCenter User Guide.
Setting DB2 External Loader Operation Modes
DB2 operation modes specify the type of load the external loader runs. You can
configure the DB2 EE or DB2 EEE external loader to run in any one of the following
operation modes:
Insert. Adds loaded data to the table without changing existing table data.
Replace. Deletes all existing data from the table, and inserts the loaded data. The
table and index definitions do not change.

Restart. Restarts a previously interrupted load operation.
Terminate. Terminates a previously interrupted load operation and rolls back the
operation to the starting point, even if consistency points were passed. The
tablespaces return to normal state, and all table objects are made consistent.

Configuring Authorities, Privileges, and Permissions
When you load data to a DB2 database using either the DB2 EE or DB2 EEE external
loader, you must have the correct authority levels and privileges to load data into to
the database tables.
DB2 privileges allow you to create or access database resources. Authority levels
provide a method of grouping privileges and higher-level database manager
maintenance and utility operations. Together, these functions control access to the
database manager and its database objects. You can access only those objects for
which you have the required privilege or authority.
To load data into a table, you must have one of the following authorities:
SYSADM authority
DBADM authority
LOAD authority on the database, with INSERT privilege
In addition, you must have proper read access and read/write permissions:
The database instance owner must have read access to the external loader input
files.
If you use run DB2 as a service on Windows, you must configure the service start
account with a user account that has read/write permissions to use LAN
resources, including drives, directories, and files.
If you load to DB2 EEE, the database instance owner must have write access to
the load dump file and the load temporary file.
Remember, the target file must be delimited when using the DB2 AutoLoader.
Guidelines for Performance Tuning
You can achieve numerous performance improvements by properly configuring the
database manager, database, and tablespace container and parameter settings. For
example, MAXFILOP is one of the database configuration parameters that you can tune.
The default value for MAXFILOP is far too small for most databases. When this value is
too small, UDB spends a lot of extra CPU processing time closing and opening files. To
resolve this problem, increase MAXFILOP value until UDB stops closing files.
You must also have enough DB2 agents available to process the workload based on the
number of users accessing the database. Incrementally increase the value of
MAXAGENTS until agents are not stolen from another application. Moreover, sufficient
memory allocated to the CATALOGCACHE_SZ database configuration parameter also
benefits the database. If the value of catalog cache heap is greater than zero, both
DBHEAP and CATALOGCACHE_SZ should be proportionally increased.

In UDB, the LOCKTIMEOUT default value is 1. In a data warehouse database, set this
value to 60 seconds. Remember to define TEMPSPACE tablespaces so that they have at
least 3 or 4 containers across different disks, and set the PREFETCHSIZE to a multiple
of EXTENTSIZE, where the multiplier is equal to the number of containers. Doing so will
enable parallel I/O for larger sorts, joins, and other database functions requiring
substantial TEMPSPACE space.
In UDB, LOGBUFSZ value of 8 is too small. Try setting it to 128. Also, set an
INTRA_PARALLEL value of YES for CPU parallelism. The database configuration
parameter DFT_DEGREE should be set to a value between ANY and 1 depending on the
number of CPUs available and number of processes that will be running simultaneously.
Setting the DFT_DEGREE to ANY can prove to be a CPU hogger since one process can
take up all the processing power with this setting. Setting it to one does not make
sense as there is no parallelism in one.
(Note: DFT_DEGREE and INTRA_PARALLEL are applicable only for EEE DB).
Data warehouse databases perform numerous sorts, many of which can be very large.
SORTHEAP memory is also used for hash joins, which a surprising number of DB2 users
fail to enable. To do so, use the db2set command to set environment variable
DB2_HASH_JOIN=ON.
For a data warehouse database, at a minimum, double or triple the SHEAPTHRES (to
between 40,000 and 60,000) and set the SORTHEAP size between 4,096 and 8,192. If
real memory is available, some clients use even larger values for these configuration
parameters.
SQL is very complex in a data warehouse environment and often consumes large
quantities of CPU and I/O resources. Therefore, set DFT_QUERYOPT to 7 or 9.
UDB uses NUM_IO_CLEANERS for writing to TEMPSPACE, temporary intermediate
tables, index creations, and more. SET NUM_IO_CLEANERS equal to the number of
CPUs on the UDB server and focus on your disk layout strategy instead.
Lastly, for RAID devices where several disks appear as one to the operating system, be
sure to do the following:
1. db2set DB2_STRIPED_CONTAINERS=YES (do this before creating tablespaces or
before a redirected restore)
2. db2set DB2_PARALLEL_IO=* (or use TablespaceID numbers for tablespaces
residing on the RAID devices for example
DB2_PARALLEL_IO=4,5,6,7,8,10,12,13)
3. Alter the tablespace PREFETCHSIZE for each tablespace residing on RAID devices
such that the PREFETCHSIZE is a multiple of the EXTENTSIZE.
Database Locks and Performance Problems
When working in an environment with many users that target a DB2 UDB database, you
may experience slow and erratic behavior resulting from the way UDB handles database
locks. Out of the box, DB2 UDB database and client connections are configured on the
assumption that they will be part of an OLTP system and place several locks on records

and tables. Because PowerCenter typically works with OLAP systems where it is the
only process writing to the database and users are primarily reading from the database,
this default locking behavior can have a significant impact on performance
Connections to DB2 UDB databases are set up using the DB2 Client Configuration
utility. To minimize problems with the default settings, make the following changes to
all remote clients accessing the database for read-only purposes. To help replicate
these settings, you can export the settings from one client and then import the
resulting file into all the other clients.
Enable Cursor Hold is the default setting for the Cursor Hold option. Edit the
configuration settings and make sure the Enable Cursor Hold option is not
checked.
Connection Mode should be Shared, not Exclusive
Isolation Level should be Read Uncommitted (the minimum level) or Read
Committed (if updates by other applications are possible and dirty reads must
be avoided)
For setting the Isolation level to dirty read at the PowerCenter Server level, you can set
a flag can at the PowerCenter configuration file. For details on this process, refer to the
KB article 13575 in my.Informatica.com support knowledgebase.
If you're not sure how to adjust these settings, launch the IBM DB2 Client Configuration
utility, then highlight the database connection you use and select Properties. In
Properties, select Settings and then select Advanced. You will see these options and
their settings on the Transaction tab
To export the settings from the main screen of the IBM DB2 client configuration utility,
highlight the database connection you use, then select Export and all. Use the same
process to import the settings on another client.
If users run hand-coded queries against the target table using DB2's Command Center,
be sure they know to use script mode and avoid interactive mode (by choosing the
script tab instead of the interactive tab when writing queries). Interactive mode can
lock returned records while script mode merely returns the result and does not hold
them.
If your target DB2 table is partitioned and resides across different nodes in DB2, you
can use a target partition type DB Partitioning in PowerCenter session properties.
When DB partitioning is selected, separate connections are opened directly to each
node and the load starts in parallel. This improves performance and scalability.


Using Shortcut Keys in PowerCenter Designer
Challenge
Using shortcuts and work-arounds to work as efficiently as possible in PowerCenter
Mapping Designer and Workflow Manager.
Description
After you are familiar with the normal operation of PowerCenter Mapping Designer and
Workflow Manager, you can use a variety of shortcuts to speed up their operation.
PowerCenter provides two types of shortcuts: keyboard shortcuts to edit repository
objects and maneuver through the Mapping Designer and Workflow Manager as
efficiently as possible; and shortcuts that simplify the maintenance of repository
objects.
General Suggestions
Maneuvering the Navigator window
Follow these steps to open a folder with workspace open as well:
1. Click the Open folder icon. (Note that double clicking on the folder name only
opens the folder if the folder has not yet been opened or connected to.)
2. Alternatively, right click the folder name, then scroll down and click Open.
Working with the toolbar
Using an icon on the toolbar is nearly always faster than selecting a command from a
drop-down menu.
To add more toolbars, select Tools | Customize.
Select the Toolbar tab to add or remove toolbars.
Follow these steps to use drop-down menus without the mouse:
1. Press and hold the <Alt> key. You will see an underline under one letter of each
of the menu titles.

2. Press the letter key of the letter underlined in the drop down menu you want.
For instance, press 'r' for the 'Repository' menu. The menu will appear.
3. Press the letter key of the letter underlined in the option you want. For instance,
press 'w' for 'Print Preview'.
4. Alternatively, after you have pressed the <Alt> key, use the right/left and
up/down arrows to scroll across and down the menus. Press Enter when the
desired command is highlighted.
To use the 'Create Customized Toolbars' feature to tailor a toolbar for the functions
you use frequently, press <Alt> <T> then <C>.
To delete customized icons, select Tools | Customize and select the Tools tab. You
can add an icon to an existing toolbar or create a new toolbar, depending on
where you "drag and drop" the icon. (Note: adding the 'Arrange' icon can speed
up the process of arranging mapping transformations.)
To rearrange the toolbars, click and drag the double bar that begins each toolbar.
You can insert more than one toolbar at the top of the designer tool to avoid
having the buttons go off the edge of the screen. Alternatively, you can move
toolbars to the bottom, side, or between the workspace and the message
windows (which is a handy place to put the transformations toolbar).
To use a Docking\UnDocking window (e.g., Repository Navigator), double click on
the window's title bar. If you have a problem making it dock again, right click
somewhere in the white space of the runaway window (not the title bar) and
make sure that the "Allow Docking" option is checked. When it is checked, drag
the window to its proper place and, when an outline of where the window used
to be appears, release the window.

Keyboard Shortcuts
Use the following keyboard shortcuts to perform various operations in Mapping
Designer and Workflow Manager.
To: Press:

To: Press:
Cancel editing in an object Esc
Check and uncheck a check box Space Bar
Copy text from an object onto a clipboard Ctrl+C
Cut text from an object onto the clipboard Ctrl+X.
Edit the text of an object F2. Then move the cursor to the desired
location
Find all combination and list boxes Type the first letter of the list
Find tables or fields in the workspace Ctrl+F
Move around objects in a dialog box Ctrl+directional arrows
Paste copied or cut text from the clipboard
into an object
Ctrl+V
Select the text of an object F2
To start help F1
Mapping Designer
Navigating the Workspace
When using the "drag & drop" approach to create Foreign Key/Primary Key
relationships between tables, be sure to start in the Foreign Key table and drag the
key/field to the Primary Key table. Set the Key Type value to "NOT A KEY" prior to
dragging.
Follow these steps to quickly select multiple transformations:
1. Hold the mouse down and drag to view a box.
2. Be sure the box touches every object you want to select. The selected items will
have a distinctive outline around them.
3. If you miss one or have an extra, you can hold down the <Shift> or <Ctrl> key
and click the offending transformations one at a time. They will alternate
between being selected and deselected each time you click on them.
Follow these steps to copy and link fields between transformations:
1. You can select multiple ports when you are trying to link to the next
transformation.
2. When you are linking multiple ports, they are linked in the same order as they
are in the source transformation. You need to highlight the fields you want in the
source transformation and hold the mouse button over the port name in the
target transformation that corresponds to the source transformation port.
3. Use the Autolink function whenever possible. It is located under the Layout
menu or accessible by right-clicking somewhere in the background of the
Mapping Designer.
4. Autolink can link by name or position. PowerCenter version 6 or above gives you
the option of entering prefixes or suffixes (when you click the 'More' button).
This is especially helpful when you are trying to autolink to a Router
transformation, for instance. Each group created in a Router will have a distinct
suffix number added to the port/field name. To autolink, you need to choose the
proper Router and Router group in the 'From Transformation' space. You also

need to click the 'More' button and enter the appropriate suffix value. You must
do both to create a link.
5. Autolink does not work if any of the fields in the 'To' transformation are already
linked to another group or another stream. No error appears; the links are just
not created.
Sometimes, a shared object is very close to, but not exactly what you need. In this
case, you may want to make a copy with some minor alterations to suit your purposes.
If you try to simply click and drag the object, it will ask you if you want to make a
shortcut or it will be reusable every time. Follow these steps to make a non-reusable
copy of a reusable object:
1. Open the target folder.
2. Select the object that you want to make a copy of, either in the source or target
folder.
3. Drag the object over the workspace.
4. Press and hold the <Ctrl> key (the crosshairs symbol '+' will appear in a white
box)
5. Release the mouse button, then release the <Ctrl> key.
6. A copy confirmation window and a copy wizard window will appear. Note that
the look and feel of the copy wizard differs between versions 6 and 7.
7. The newly created transformation no longer says that it is reusable and you are
free to make changes without affecting the original reusable object.
Editing Tables/Transformation
Follow these steps to move one port in a transformation:
1. Double click the transformation and make sure you are in the "Ports" tab. (You
go directly to the Ports tab if you double click a port instead of the colored title
bar.)
2. Highlight the port and use the up/down arrow keys with the mouse (see red
circle in the figure below).
3. Or, highlight the port and then press <Alt><w> for down or <Alt> for up.
(Note: You can hold down the <Alt> and hit the <w> or as often as you
need although this may not be practical if you are moving far).
Alternatively, you can accomplish the same thing by following these steps:
1. Highlight the port you want to move by clicking the number beside the port
(note the blue arrow in the figure below).
2. Hold down the <Alt> key and grab the port by its number.
3. Drag the port to the desired location (the list of ports scrolls when you reach the
end). A red line indicates the new location (note the red arrow in the figure
below).
4. When the red line is pointing to the desired location, release the mouse button,
then release the <Alt> key.
Note that you cannot move more than one port at a time with this method. See below
for instructions on moving more than one port at a time.


If you are using PowerCenter version 6 or 7 and the ports you are moving are adjacent,
you can follow these steps to move more than one port at a time:
1. Highlight the ports you want to move by clicking the number beside the port
while holding down the <Ctrl> key.
2. Use the up/down arrows (see the red circle above) to move the ports to the
desired location. To add a new field or port, first highlight an existing field or
port, then press <Alt><f> to insert the new field/port below it.
To validate the Default value, first highlight the port you want to validate, and
then press <Alt><v>.
When adding a new port, just begin typing. There is no need to first press DEL to
remove the "NEWFIELD" text, or to click OK when you have finished.
This is also true when you are editing a field, as long as you have highlighted the port
so that the entire Port Name cell has a light box around it. The white box is created
when you click on the white space of the port name cell. If you click on the words in the
Port Name cell, a cursor will appear where you click. At this point, delete the parts of
the word you dont want.
When moving about in the fields of the Ports tab of the Expression Editor, use the
SPACE bar to check or uncheck the port type. Be sure to highlight the port box
to check or uncheck the port type.
Follow either of these steps to quickly open the Expression Editor of an OUT/VAR port:
1. Highlight the expression so that there is a box around the cell and press <F2>
followed by <F3>.
2. Or, highlight the expression so that there is a cursor somewhere in the
expression, then press <F2>.
To cancel an edit in the grid, press <Esc> so the changes are not saved.
For all combo/dropdown list boxes, type the first letter on the list to select the
item you want. For instance, you can highlight a port's Data type box without
displaying the drop-down. To change it to 'binary', type . Then use the
arrow keys to go down to the next port. This is very handy if you want to
change all fields to string for example because using the up and down arrows
and hitting a letter is much faster than opening the drop-down menu and
making a choice each time.
To copy a selected item in the grid, press <Ctrl><c>.
To past a selected item from the Clipboard to the grid, press <Ctrl><v>.

To delete a selected field or port from the grid, press <Alt><c>.
To copy a selected row from the grid, press <Alt><o>.
To paste a selected row from the grid, press <Alt>.
You can use either of the following methods to delete more than one port at a time.
You can repeatedly hit the cut button (red circle below); or
You can highlight several records and then click the cut button. Use <Shift> to
highlight many items in a row or <Ctrl> to highlight multiple non-contiguous items. Be
sure to click on the number beside the port, not the port name while you are holding
<Shift> or <Ctrl>.

Editing Expressions
Follow either of these steps to expedite validation of a newly created expression:
Click on the <Validate> button or press <Alt> and <v>. Note that this validates
and leaves the Expression Editor up.
Or, press <OK> to initiate parsing/validating the expression. The system will close
the Expression Editor if the validation is successful. If you click OK once again in
the "Expression parsed successfully" pop-up, the Expression Editor remains
open.
There is little need to type in the Expression Editor. The tabs list all functions, ports,
and variables that are currently available. If you want an item to appear in the Formula
box, just double click on it in the appropriate list on the left. This helps to avoid
typographical errors and mistakes such as including an output-only port name in an
expression.
In version 6.0 an above, if you change a port name, PowerCenter automatically updates
any expression that uses that port with the new name.
Be careful about changing data types. Any expression using the port with the new data
type may remain valid, but not perform as expected. If the change invalidates the
expression, it will be detected when the object is saved or if the Expression Editor is
active for that expression.
The following table summarizes additional shortcut keys that are applicable only when
working with Mapping Designer:


Repository Object Shortcuts
A repository object defined in a shared folder can be reused across folders by creating a
shortcut (i.e., a dynamic link to the referenced object).
Whenever is possible, reuse source definitions, target definitions, reusable
transformations, mapplets, and mappings. Reusing objects allows sharing complex
mappings, mapplets or reusable transformations across folders, saves space in the
repository, and reduces maintenance.
Follow these steps to create a repository object shortcut:
1. Expand the Shared Folder.
2. Click and drag the object definition into the mapping that is open in the
workspace.
3. As the cursor enters the workspace, the object icon will appear along with a
small curve; as an example, the icon should look like this:

4. A dialog box will appear. Confirm that you want to create a shortcut.
If you want to copy an object from a shared folder instead of creating a shortcut, hold
down the <Ctrl> key before dropping the object into the workspace.
Workflow Manager
Navigating the Workspace
When editing a repository object or maneuvering around the Workflow Manager, use
the following shortcuts to speed up the operation you are performing:
To: Press
Add a new field or port Alt + F
Copy a row Alt + O
Cut a row Alt + C
Move current row down Alt + W
Move current row up Alt + U
Paste a row Alt + P
Validate the default value in a transformation Alt + V
Open the Expression Editor from the expression field F2, then press F3
To start the debugger F9


Repository Object Shortcuts
Mappings that reside in a shared folder can be reused within workflows by creating
shortcut mappings.
A set of workflow logic can be reused within workflows by creating a reusable worklet.

To: Press:
Create links Press Ctrl+F2 to select first task you want to link.
Press Tab to select the rest of the tasks you want
to link
Press Ctrl+F2 again to link all the tasks you
selected
Edit tasks name in the workspace F2
Expand a selected node and all its
children
SHIFT + * (use asterisk on numeric keypad)
Move across to select tasks in the
workspace
Tab
Select multiple tasks Ctrl + Mouse click


Web Services
Challenge
Understanding PowerCenter Connect for Web Services and configuring PowerCenter to
access a secure web service.
Description
PowerCenter Connect for Web Services (aka WebServices Consumer) allows
PowerCenter to act as a web services client to consume external web services.
PowerCenter Connect for Web Services uses the Simple Object Access Protocol (SOAP)
to communicate with the external web service provider. An external web service can be
invoked from PowerCenter in three ways:
Web Service source
Web Service transformation
Web Service target
Web Service Source Usage
PowerCenter supports a request-response type of operation using Web Services source.
You can use the web service as a source if the input in the SOAP request remains fairly
constant since input values for a web service source can only be provided at the source
transformation level.
Note: If a SOAP fault occurs, it is treated as a fatal error, logged in the session log, and
the session is terminated.
The following steps serve as an example for invoking a temperature web service to
retrieve the current temperature for a given zip code:
1. In Source Analyzer, click Import from WSDL(Consumer).
2. Specify URL http://www.xmethods.net/sd/2001/TemperatureService.wsdl and
pick operation getTemp.
3. Open the Web Services Consumer Properties tab and click Populate SOAP
request and populate the desired zip code value.
4. Connect the output port of the web services source to the target.
Web Service Transformation Usage

PowerCenter also supports a request-response type of operation using Web Services
transformation. You can use the web service as a transformation if your input data is
available midstream and you want to capture the response values from the web
service. If a SOAP fault occurs, it is considered as a row error and logged into the
session log.
The following steps serve as an example for invoking a Stock Quote web service to
learn the price for each of the ticker symbols available in a flat file:
1. In transformation developer, create a web service consumer transformation.
2. Specify URL http://services.xmethods.net/soap/urn:xmethods-delayed-
quotes.wsdl and pick operation getQuote.
3. Connect the input port of this transformation to the field containing the ticker
symbols.
4. To invoke the web service for each input row, change to source-based commit
and the interval to 1. Also change the Transaction Scope to Transaction in the
web services consumer transformation.
Web Service Target Usage
PowerCenter supports a one-way type of operation using Web Services target. You can
use the web service as a target if you only need to send a message (i.e., and do not
need a response). PowerCenter only waits for the web server to start processing the
message; it does not wait for the web server to finish processing the web service
operation. If a SOAP fault occurs, it is considered as a row error and logged into the
session log.
The following provides an example for invoking a sendmail web service:
1. In Warehouse Designer, click Import from WSDL(Consumer)
2. Specify URL
http://webservices.matlus.com/scripts/emailwebservice.dll/wsdl/IEmailService
and pick operation SendMail
3. In the mapping, connect the input ports of the web services target to the ports
containing appropriate values.
PowerCenter Connect for Web Services and Web Services Provider
Informatica also offers a product called Web Services Provider which differs from
PowerCenter Connect for Web Services.
In Web Services Provider, PowerCenter acts as a Service Provider and exposes
many key functionalities as web services.
In PowerCenter Connect for Web Services, PowerCenter acts as a web service
client and consumes external web services.
It is not necessary to install or configure Web Services Provider in order to use
PowerCenter Connect for Web Services.

Configuring PowerCenter to Invoke a Secure Web Service

Secure Sockets Layer (SSL) is used to provide such security features as authentication
and encryption to web services applications. The authentication certificates follow the
Public Key Infrastructure (PKI) standard, a system of digital certificates provided by
certificate authorities to verify and authenticate parties of Internet communications or
transactions. These certificates are managed in the following two keystore files:
Truststore. Truststore holds the public keys for the entities it can trust.
PowerCenter uses the entries in the Truststore file to authenticate the external
web services servers.
Keystore (Clientstore). Clientstore holds both the entitys public and private
keys. PowerCenter sends the entries in the Clientstore file to the web services
server so that the web services server can authenticate the PowerCenter
server.
By default, the keystore files jssecacerts and cacerts in the $(JAVA_HOME)/lib/ security
directory are used for Truststores. You can also create new keystore files and configure
the TrustStore and ClientStore parameters in the PowerCenter Server setup to point to
these files. Keystore files can contain multiple certificates and are managed using
utilities like keytool.
SSL authentication can be performed in three ways:
Server authentication
Client authentication
Mutual authentication
Server authentication:
When establishing an SSL session in server authentication, the web services server
sends its certificate to PowerCenter and PowerCenter verifies whether the server
certificate can be trusted. Only the truststore file needs to be configured in this case.
Assumptions:
Web Services Server certificate is stored in server.cer file
PowerCenter Server(Client) public/private key pair is available in keystore client.jks
Steps:
1. Import the servers certificate into the PowerCenter Servers truststore file. You
can use either the default keystores jssecacerts, cacerts or create your own
keystore file.
2. keytool -import -file server.cer -alias wserver -keystore trust.jks trustcacerts
storepass changeit
3. At the prompt for trusting this certificate, type yes.
4. Configure PowerCenter to use this truststore file. Open the PowerCenter Server
setup-> JVM options tab and in the value for Truststore, give the full path and
name of the keystore file (e.g., c:\trust.jks)
Client authentication:

When establishing an SSL session in client authentication, PowerCenter sends its
certificate to the web services server. The web services server then verifies whether the
PowerCenter Server can be trusted. In this case, you need only the clientstore file.
Steps:
1. Keystore containing the private/public key pair is called client.jks. Be sure the
client private key password and the keystore password are the same, (e.g.,
changeit)
2. Configure PowerCenter to use this clientstore file. Open the PowerCenter Server
setup-> JVM options tab and in the value for Clientstore, type the full path and
name of the keystore file (e.g., c:\client.jks)
3. Add an additional JVM parameter in the PowerCenter Server setup and give the
value as Djavax.net.ssl.keyStorePassword=changeit
Mutual authentication:
When establishing an SSL session in mutual authentication, both PowerCenter Server
and the Web Services server send their certificates to each other and both verify if the
other one can be trusted. You need to configure both the clientstore and the truststore
files.
Steps:
1. Import the servers certificate into the PowerCenter Servers truststore file.
2. keytool -import -file server.cer -alias wserver -keystore trust.jks trustcacerts
storepass changeit
3. Configure PowerCenter to use this truststore file. Open the PowerCenter server
setup-> JVM options tab and in the value for Truststore, type the full path and
name of the keystore file (e.g., c:\trust.jks).
4. Keystore containing the client public/private key pair is called client.jks. Be sure
the client private key password and the keystore password are the same (e.g.,
changeit).
5. Configure PowerCenter to use this clientstore file. Open the PowerCenter Server
setup-> JVM options tab and in the value for Clientstore, type the full path and
name of the keystore file (e.g., c:\client.jks).
6. Add an additional JVM parameter in the PowerCenter Server setup and type the
value as
Djavax.net.ssl.keyStorePassword=changeit
Note: If your client private key is not already present in the keystore file, you cannot
use keytool command to import it. Keytool can only generate a private key; it cannot
import a private key into a keystore. In this case, use an external java utility such as
utils.ImportPrivateKey(weblogic), KeystoreMove (to convert PKCS#12 format to JKS) to
move it into the JKS keystore.
Converting Other Formats of Certificate Files
There are a number of formats of certificate files available: DER format (.cer and .der
extensions); PEM format (.pem extension); and PKCS#12 format (.pfx or .P12
extension). You can convert from one format of certificate to another using openssl.

Refer to the openssl documentation for complete information on such conversion. A few
examples are given below:
To convert from PEM to DER: assuming that you have a PEM file called
server.pem
openssl x509 -in server.pem -inform PEM -out server.der -outform DER
To convert a PKCS12 file, you must first convert to PEM, and then from PEM to
DER:
Assuming that your PKCS12 file is called server.pfx, the two commands are:
openssl pkcs12 -in server.pfx -out server.pem
openssl x509 -in server.pem -inform PEM -out server.der -outform DER


Working with PowerCenter Connect for MQSeries
Challenge
Understanding how to use IBM MQSeries applications in PowerCenter mappings.
Description
MQSeries applications communicate by sending messages asynchronously rather than
by calling each other directly. Applications can also request data using a "request
message" on a message queue. Because no open connection is required between
systems, they can run independently of one another. MQSeries enforces no structure on
the content or format of the message; this is defined by the application.
With more and more requirements for on-demand or real-time analytics, as well as
the development of Enterprise Application Integration (EAI) capabilities, MQ Series has
become an important vehicle for providing information to data warehouses in a real-
time mode.
PowerCenter provides data integration for transactional data generated by online
continuously messaging systems (such as MQ Series). For these types of messaging
systems, PowerCenters Zero Latency (ZL) Engine provides immediate processing of
trickle-feed data, allowing the processing of real-time data flow in both uni-directional
and bi-directional manner.
TIP: In order to enable PowerCenters ZL engine to process MQ messages in real-time,
the workflow must be configured to run continuously and a real-time MQ filter needs to
be applied to the MQ source qualifier (such as idle time, reader time limit, or message
count).
MQSeries Architecture
IBM MQSeries is a messaging and queuing application that permits programs to
communicate with one another across heterogeneous platforms and network protocols
using a consistent application-programming interface.
MQSeries architecture has three parts:
1. Queue Manager
2. Message Queue, which is a destination to which messages can be sent

3. MQSeries Message, which incorporates a header and a data component
Queue Manager
PowerCenter connects to Queue Manager to send and receive messages.
A Queue Manager may publish one or more MQ queues.
Every message queue belongs to a Queue Manager.
Queue Manager administers queues, creates queues, and controls queue
operation.
Message Queue
PowerCenter connects to Queue Manager to send and receive messages to one or
more message queues.
PowerCenter is responsible to deleting the message from the queue after
processing it.
TIP: There are several ways to maintain transactional consistency (i.e., clean up the
queue after reading). Refer to the Informatica Webzine article on Transactional
Consistency for details on the various ways to delete messages from the queue.
MQSeries Message
An MQSeries message is composed of two distinct sections:
MQSeries header. This section contains data about the queue message itself.
Message header data includes the message identification number, message
format, and other message descriptor data. In PowerCenter, MQSeries sources
and dynamic MQSeries targets automatically incorporate MQSeries message
header fields.
MQSeries message data block. A single data element that contains the
application data (sometime referred to as the "message body"). The content and
format of the message data is defined by the application that puts the message
on the queue.

Extracting Data from a Queue
Reading Messages from a Queue
In order for PowerCenter to extract from the message data block, the source system
must define the data in one of the following formats:
Flat file (fixed width or delimited)
XML
COBOL
Binary
When reading a message from a queue, the PowerCenter mapping must contain an MQ
Source Qualifier (MQSQ). If the mapping also needs to read the message data block,
then an Associated Source Qualifier (ASQ) is also needed. When developing an MQ

Series mapping, the MESSAGE_DATA block is re-defined by the ASQ. Based on the
format of the source data, PowerCenter will generate the appropriate transformation for
parsing the MESSAGE_DATA. Once associated, the MSG_ID field is linked within the
associated source qualifier transformation.
Applying Filters to Limit Messages Returned
Filters can be applied to the MQ Source Qualifier to reduce the number of messages
read.
Filters can also be added to control the length of time PowerCenter reads the MQ
queue.
If no filters are applied, PowerCenter reads all messages in the queue and then stops
reading.
Example:
PutDate >= 20040901 && PutDate <= 20040930
TIP: In order to leverage reading a single MQ queue to process multiple record types,
have the source application populate an MQ header field and then filter the value set in
this field (Example: ApplIdentityData = TRM).
Using MQ Functions
PowerCenter provides built-in functions that can also be used to filter message data.
Functions can be used to control the end-of-file of the MQSeries queue.
Functions can be used to enable PowerCenter real-time data extraction.
Available Functions:
Function Description
Idle(n) Time RT remains idle before stopping.
MsgCount(n) Number of messages read from the queue before stopping.
StartTime(time) GMT time when RT begins reading queue.
EndTime(time) GMT time when RT stops reading queue.
FlushLatency(n) Time period RT waits before committing messages read from the
queue.
ForcedEOQ(n) Time period RT reads messages from the queue before stopping.
RemoveMsg(TRUE) Removes messages from the queue.
TIP: In order to enable real-time message processing, use the FlushLatency() or
ForcedEOQ() MQ functions as part of the filter expression in the MQSQ.
Loading Message to a Queue
PowerCenter supports two types of MQ targeting: Static and Dynamic.

Static MQ Targets. Used for loading message data (instead of header data) to
the target. A Static target does not load data to the message header fields. Use
the target definition specific to the format of the message data (i.e., flat file,
XML, or COBOL). Design the mapping as if it were not using MQ Series, then
configure the target connection to point to a MQ message queue in the session
when using MQSeries.
Dynamic. Used for binary targets only, and when loading data to a message
header. Note that certain message headers in an MQSeries message require a
predefined set of values assigned by IBM.
Dynamic MQSeries Targets
Use this type of target if message header fields need to be populated from the ETL
pipeline.
MESSAGE_DATA field data type is binary only.
Certain fields cannot be populated by the pipeline (i.e., set by the target MQ
environment):
UserIdentifier
AccountingToken
ApplIdentityData
PutApplType
PutApplName
PutDate
PutTime
ApplOriginData
Static MQSeries Targets
Unlike dynamic targets, where an MQ target transformation exists in the mapping,
static targets use existing target transformations.
Flat file
XML
COBOL
RT can only write to one MQ queue per target definition.
XML targets with multiple hierarchies can generate one or more MQ messages
(configurable).
Creating and Configuring MQSeries Sessions
After you create mappings in the Designer, you can create and configure sessions in the
Workflow Manager.
Configuring MQSeries Sources
The MQSeries source definition represents the metadata for the MQSeries source in the
repository. Unlike other source definitions, you do not create an MQSeries source
definition by importing the metadata from the MQSeries source. Since all MQSeries

messages contain the same message header and message data fields, the Designer
provides an MQSeries source definition with predefined column names.
MQSeries Mappings
MQSeries mappings cannot be partitioned if an associated source qualifier is used.
For MQ Series sources, set the Source Type to the following:
Heterogeneous - when there is an associated source definition in the mapping.
This indicates that the source data is coming from an MQ source, and the
message data is in flat file, COBOL or XML format.
Message Queue - when there is no associated source definition in the mapping.
Note that there are two pages on the Source Options dialog: XML and MQSeries. You
can alternate between the two pages to set configurations for each.
Configuring MQSeries Targets
For Static MQSeries targets, select File Target type from the list. When the target is an
XML file or XML message data for a target message queue, the target type is
automatically set to XML.
If you load data to a dynamic MQ target, the target type is automatically set to
Message Queue.
On the MQSeries page, select the MQ connection to use for the source message
queue, and click OK.
Be sure to select the MQ checkbox in Target Options for the Associated file type.
Then click Edit Object Properties and type:
o the connection name of the target message queue.
o the format of the message data in the target queue (ex. MQSTR).
o the number of rows per message (only applies to flat file MQ targets).
Considerations when Working with MQSeries
The following features and functions are not available to PowerCenter when using
MQSeries:
Lookup transformations can be used in an MQSeries mapping, but lookups on
MQSeries sources are not allowed.
No Debug "Sessions". You must run an actual session to debug a queue mapping.
Certain considerations are necessary when using AEPs, Aggregators, Joiners,
Sorters, Rank, or Transaction Control transformations because they can only be
performed on one queue, as opposed to a full data set.
The MQSeries mapping cannot contain a flat file target definition if you are trying
to target an MQSeries queue.
PowerCenter version 6 and earlier performs a browse of the MQ queue.
PowerCenter version 7 provides the ability to perform a destructive read of the
MQ queue (instead of a browse).
PowerCenter version 7 also provides support for active transformations (i.e.,
Aggregators) in an MQ source mapping.

PowerCenter version 7 provides MQ message recovery on restart of failed sessions.
PowerCenter version 7 offers enhanced XML capabilities for mid-stream XML
parsing.

Appendix Information
PowerCenter uses the following datatypes in MQSeries mappings:
IBM MQSeries datatypes. IBM MQSeries datatypes appear in the MQSeries
source and target definitions in a mapping.
Native datatypes. Flat file, XML, or COBOL datatypes associated with an
MQSeries message data. Native datatypes appear in flat file, XML and COBOL
source definitions. Native datatypes also appear in flat file and XML target
definitions in the mapping.
Transformation datatypes. Transformation datatypes are generic datatypes that
PowerCenter uses during the transformation process. They appear in all the
transformations in the mapping.

IBM MQSeries Datatypes
MQSeries Datatypes Transformation Datatypes
MQBYTE BINARY
MQCHAR STRING
MQLONG INTEGER
MQHEX
Values for Message Header Fields in MQSeries Target Messages
MQSeries Message Header Description
StrucId Structure identifier
Version Structure version number
Report Options for report messages
MsgType Message type
Expiry Message lifetime
Feedback Feedback or reason code
Encoding Data encoding
CodedCharSetId Coded character set identifier
Format Format name
Priority Message priority
Persistence Message persistence
MsgId Message identifier
CorrelId Correlation identifier
BackoutCount Backout counter
ReplytoZ Name of reply queue
ReplytoQMgr Name of reply gueue Manager
UserIdentifier Defined by the environment. If the MQSeries server
cannot determine this value, the value for the field is

MQSeries Message Header Description
null.
AccountingToken Defined by the environment. If the MQSeries server
MQACT_NONE.
ApplIdentityData Application data relating to identity. The value for
ApplIdentityData is null.
PutApplType Type of application that put the message on queue.
Defined by the environment.
PutApplName Name of application that put the message on queue.
Defined by the environment. If the MQSeries server
null.
PutDate Date when the message arrives in the queue.
PutTime Time when the message arrives in queue.
ApplOrigData Application data relating to origin. Value for
ApplOriginData is null.
GroupId Group identifier
MsgSeqNumber Sequence number of logical messages within group.
Offset Offset of data in physical message from start of
logical message.
MsgFlags Message flags
OrigialLength Length of original message


A Mapping Approach to Trapping Data Errors
Challenge
To address data content errors within mappings to re-route erroneous rows to a target
other than the original target table.
Description
Identifying errors and creating an error handling strategy is an essential part of a data
warehousing project. In the production environment, data must be checked and
validated prior to entry into the data warehouse. One strategy for handling errors is to
maintain database constraints. Another approach is to use mappings to trap data
errors.
The first step in using mappings to trap errors is to understand and identify the error
handling requirements.
Consider the following questions:
What types of errors are likely to be encountered?
Of these errors, which ones should be captured?
What process can capture the possible errors?
Should errors be captured before they have a chance to be written to the target
database?
Should bad files be used?
Will any of these errors need to be reloaded or corrected?
How will the users know if errors are encountered?
How will the errors be stored?
Should descriptions be assigned for individual errors?
Can a table be designed to store captured errors and the error descriptions?
Capturing data errors within a mapping and re-routing these errors to an error table
allows for easy analysis by the end users and improves performance. One practical
application of the mapping approach is to capture foreign key constraint errors. This
can be accomplished by creating a lookup into a dimension table prior to loading the
fact table. Referential integrity is assured by including this functionality in a mapping.
The database still enforces the foreign key constraints, but erroneous data will not be
written to the target table. Also, if constraint errors are captured within the mapping,

the PowerCenter server will not have to write the error to the session log and the
reject/bad file.
Data content errors can also be captured in a mapping. Mapping logic can identify data
content errors and attach descriptions to the errors. This approach can be effective for
many types of data content errors, including: date conversion, null values intended for
not null target fields, and incorrect data formats or data types.
Error Handling Example
In the following example, we want to capture null values before they enter into target
fields that do not allow nulls.
Once weve identified the null values, the next step is to separate these errors from the
data flow.Use the Router Transformation to create a stream of data that will be the
error route. Any row containing an error (or errors) will be separated from the valid
data and uniquely identified with a composite key consisting of a MAPPING_ID and a
ROW_ID. The MAPPING_ID refers to the mapping name and the ROW_ID is generated
by a Sequence Generator. The composite key allows developers to trace rows written to
the error tables.
Error tables are important to an error handling strategy because they store the
information useful to error identification and troubleshooting. In this example, the two
error tables are ERR_DESC_TBL and TARGET_NAME_ERR.
The ERR_DESC_TBL table will hold information about the error, such as the mapping
name, the ROW_ID, and a description of the error. This table is designed to hold all
error descriptions for all mappings within the repository for reporting purposes.
The TARGET_NAME_ERR table will be an exact replica of the target table with two
additional columns: ROW_ID and MAPPING_ID. These two columns allow the
TARGET_NAME_ERR and the ERR_DESC_TBL to be linked. The TARGET_NAME_ERR
table provides the user with the entire row that was rejected, enabling the user to trace
the error rows back to the source. These two tables might look like the following:

The error handling functionality assigns a unique description for each error in the
rejected row. In this example, any null value intended for a not null target field will

generate an error message such as Column1 is NULL or Column2 is NULL. This step
can be done in an Expression Transformation.
After the field descriptions are assigned, we need to break the error row into several
rows, with each containing the same content except for a different error description.
You can use the Normalizer Transformation to break one row of data into many rows.
After a single row of data is separated based on the number of possible errors on it, we
need to filter the columns within the row that are actually errors. One record of data
may have zero to multiple errors. In this example, the record has three errors. We
needs to generate three error rows with the different error descriptions (ERROR_DESC)
to table ERR_DESC_TBL.
When the error records are written to ERR_DESC_TBL, we can link those records to the
one record in table TARGET_NAME_ERR using the ROW_ID and MAPPING_ID. The
following chart shows how the two error tables can be linked. Focus on the bold
selections in both tables.
TARGET_NAME_ERR
Column1 Column2 Column3 ROW_ID MAPPING_ID
NULL NULL NULL 1 DIM_LOAD
ERR_DESC_TBL
FOLDER_NAME MAPPING_ID ROW_ID ERROR_DESC LOAD_DATE SOURCE Target
CUST DIM_LOAD 1 Column 1 is
NULL
SYSDATE DIM FACT
NULL
SYSDATE DIM FACT
NULL
SYSDATE DIM FACT
The solution example would look like the following in a mapping:

The mapping approach is effective because it takes advantage of reusable objects,
thereby using the same logic repeatedly within a mapplet. This makes error detection
easy to implement and manage in a variety of mappings.
By adding another layer of complexity within the mappings, errors can be flagged as
soft or hard.
A hard error can be defined as one that would fail when being written to the
database, such as a constraint error.
A soft error can be defined as a data content error.
A record flagged as a hard error is written to the error route, while a record flagged as
a soft error can be written to boththe target system and the error tables. This gives
business analysts an opportunity to evaluate and correct data imperfections while still
allowing the records to be processed for end-user reporting.
Ultimately, business organizations need to decide if the analysts should fix the data in
the reject table or in the source systems. The advantage of the mapping approach is
that all errors are identified as either data errors or constraint errors and can be
properly addressed. The mapping approach also reports errors based on projects or
categories by identifying the mappings that contain errors. The most important aspect
of the mapping approach however, is its flexibility. Once an error type is identified, the
error handling logic can be placed anywhere within a mapping. By using the mapping
approach to capture identified errors, data warehouse operators can effectively
communicate data quality issues to the business users.


Error Handling Strategies
Challenge
The challenge is to accurately and efficiently load data into the target data architecture.
This Best Practice describes various loading scenarios, the use of data profiles, an
alternate method for identifying data errors, methods for handling data errors, and
alternatives for addressing the most common types of problems. For the most part,
these strategies are relevant whether your data integration project is loading an
operational data structure (as with data migrations, consolidations, or loading various
sorts of operational data stores) or loading a data warehousing structure.
Description
Regardless of target data structure, your loading process must validate that the data
conforms to known rules of the business. When the source system data does not meet
these rules, the process needs to handle the exceptions in an appropriate manner. The
business needs to be aware of the consequences of either permitting invalid data to
enter the target or rejecting it until it is fixed. Both approaches present complex
issues. The business must decide what is acceptable and prioritize two conflicting
goals:
The need for accurate information
The ability to analyze or process the most complete information available with the
understanding that errors can exist.

Data Integration Process Validation
In general, there are three methods for handling data errors detected in the loading
process:
Reject All. This is the simplest to implement since all errors are rejected from
entering the target when they are detected. This provides a very reliable target
that the users can count on as being correct, although it may not be complete.
Both dimensional and factual data can be rejected when any errors are
encountered. Reports indicate what the errors are and how they affect the
completeness of the data.

Dimensional or Master Data errors can cause valid factual data to be rejected

because a foreign key relationship cannot be created. These errors need to be
fixed in the source systems and reloaded on a subsequent load. Once the
corrected rows have been loaded, the factual data will be reprocessed and
loaded, assuming that all errors have been fixed. This delay may cause some
user dissatisfaction since the users need to take into account that the data they
are looking at may not be a complete picture of the operational systems until
the errors are fixed. For an operational system, this delay may affect
downstream transactions.

The development effort required to fix a Reject All scenario is minimal, since the
rejected data can be processed through existing mappings once it has been
fixed. Minimal additional code may need to be written since the data will only
enter the target if it is correct, and it would then be loaded into the data mart
using the normal process.
Reject None. This approach gives users a complete picture of the available data
without having to consider data that was not available due to it being rejected
during the load process. The problem is that the data may not be complete or
accurate. All of the target data structures may contain incorrect information
that can lead to incorrect decisions or faulty transactions.

With Reject None, the complete set of data is loaded but the data may not
support correct transactions or aggregations. Factual data can be allocated to
dummy or incorrect dimension rows, resulting in grand total numbers that are
correct, but incorrect detail numbers. After the data is fixed, reports may
change, with detail information being redistributed along different hierarchies.

The development effort to fix this scenario is significant. After the errors are
corrected, a new loading process needs to correct all of the target data
structures, which can be a time-consuming effort based on the delay between an
error being detected and fixed. The development strategy may include removing
information from the target, restoring backup tapes for each nights load, and
reprocessing the data. Once the target is fixed, these changes need to be
propagated to all downstream data structures or data marts.
Reject Critical. This method provides a balance between missing information and
incorrect information. This approach involves examining each row of data, and
determining the particular data elements to be rejected. All changes that are
valid are processed into the target to allow for the most complete picture.
Rejected elements are reported as errors so that they can be fixed in the source
systems and loaded on a subsequent run of the ETL process.

This approach requires categorizing the data in two ways: 1) as Key Elements or
Attributes, and 2) as Inserts or Updates.

Key elements are required fields that maintain the data integrity of the target
and allow for hierarchies to be summarized at different levels in the
organization. Attributes provide additional descriptive information per key
element.

Inserts are important for dimensions or master data because subsequent factual
data may rely on the existence of the dimension data row in order to load
properly. Updates do not affect the data integrity as much because the factual

data can usually be loaded with the existing dimensional data unless the update
is to a Key Element.

The development effort for this method is more extensive than Reject All since it
involves classifying fields as critical or non-critical, and developing logic to
update the target and flag the fields that are in error. The effort also
incorporates some tasks from the Reject None approach in that processes must
be developed to fix incorrect data in the entire target data architecture.

Informatica generally recommends using the Reject Critical strategy to maintain
the accuracy of the target. By providing the most fine-grained analysis of errors,
this method allows the greatest amount of valid data to enter the target on each
run of the ETL process, while at the same time screening out the unverifiable
data fields. However, business management needs to understand that some
information may be held out of the target, and also that some of the information
in the target data structures may be at least temporarily allocated to the wrong
hierarchies.

Using Profiles
Profiles are tables used to track history changes to the source data. As the
source systems change, Profile records are created with date stamps that indicate when
the change took place. This allows power users to review the target data using either
current (As-Is) or past (As-Was) views of the data.
Profiles should occur once per change in the source systems. Problems occur when two
fields change in the source system and one of those fields produces an error. When the
second field is fixed, it is difficult for the ETL process to produce a reflection of data
changes since there is now a question whether to update a previous Profile or create a
new one. The first value passes validation, which produces a new Profile record, while
the second value is rejected and is not included in the new Profile. When this error is
fixed, it would be desirable to update the existing Profile rather than creating a new
one, but the logic needed to perform this UPDATE instead of an INSERT is complicated.
If a third field is changed before the second field is fixed, the correction process cannot
be automated. The following hypothetical example represents three field values in a
source system. The first row on 1/1/2000 shows the original values. On 1/5/2000, Field
1 changes from Closed to Open, and Field 2 changes from Black to BRed, which is
invalid. On 1/10/2000 Field 3 changes from Open 9-5 to Open 24hrs, but Field 2 is still
invalid. On 1/15/2000, Field 2 is finally fixed to Red.
Date Field 1 Value Field 2 Value Field 3 Value
1/1/2000 Closed Sunday Black Open 9 5
1/5/2000 Open Sunday BRed Open 9 5
1/10/2000 Open Sunday BRed Open 24hrs
1/15/2000 Open Sunday Red Open 24hrs
Three methods exist for handling the creation and update of Profiles:

1. The first method produces a new Profile record each time a change is detected in the
source. If a field value was invalid, then the original field value is maintained.
Date Profile Date Field 1 Value Field 2
Value
Field 3 Value
1/1/2000 1/1/2000 Closed Sunday Black Open 9 5
1/5/2000 1/5/2000 Open Sunday Black Open 9 5
1/10/2000 1/10/2000 Open Sunday Black Open 24hrs
1/15/2000 1/15/2000 Open Sunday Red Open 24hrs
By applying all corrections as new Profiles in this method, we simplify the process by
directly applying all changes to the source system directly to the target. Each change --
regardless if it is a fix to a previous error -- is applied as a new change that creates a
new Profile. This incorrectly shows in the target that two changes occurred to the
source information when, in reality, a mistake was entered on the first change and
should be reflected in the first Profile. The second Profile should not have been created.
2. The second method updates the first Profile created on 1/5/2000 until all fields are
corrected on 1/15/2000, which loses the Profile record for the change to Field 3.
Date Profile Date Field 1 Value Field 2 Value Field 3 Value
1/10/2000 1/5/2000
(Update)
Open Sunday Black Open 24hrs
1/15/2000 1/5/2000
(Update)
Open Sunday Red Open 24hrs
If we try to apply changes to the existing Profile, as in this method, we run the risk of
losing Profile information. If the third field changes before the second field is fixed, we
show the third field changed at the same time as the first. When the second field was
fixed it would also be added to the existing Profile, which incorrectly reflects the
changes in the source system.
3. The third method creates only two new Profiles, but then causes an update to the
Profile records on 1/15/2000 to fix the Field 2 value in both.
Date Profile Date Field 1 Value Field 2 Value Field 3 Value
1/10/2000 1/10/2000 Open Sunday Black Open 24hrs
1/15/2000 1/5/2000
(Update)
Open Sunday Red Open 9-5
1/15/2000 1/10/2000
(Update)
Open Sunday Red Open 24hrs
If we try to implement a method that updates old Profiles when errors are fixed, as in
this option, we need to create complex algorithms that handle the process correctly. It
involves being able to determine when an error occurred and examining all Profiles

generated since then and updating them appropriately. And, even if we create the
algorithms to handle these methods, we still have an issue of determining if a value is a
correction or a new value. If an error is never fixed in the source system, but a new
value is entered, we would identify it as a previous error, causing an automated process
to update old Profile records, when in reality a new Profile record should have been
entered.
Recommended Method
A method exists to track old errors so that we know when a value was rejected. Then,
when the process encounters a new, correct value it flags it as part of the load strategy
as a potential fix that should be applied to old Profile records. In this way, the corrected
data enters the target as a new Profile record, but the process of fixing old Profile
records, and potentially deleting the newly inserted record, is delayed until the data is
examined and an action is decided. Once an action is decided, another process
examines the existing Profile records and corrects them as necessary. This method only
delays the As-Was analysis of the data until the correction method is determined
because the current information is reflected in the new Profile.
Data Quality Edits
Quality indicators can be used to record definitive statements regarding the quality of
the data received and stored in the target. The indicators can be append to existing
data tables or stored in a separate table linked by the primary key. Quality indicators
can be used to:
show the record and field level quality associated with a given record at the time
of extract
identify data sources and errors encountered in specific records
support the resolution of specific record error types via an update and
resubmission process.
Quality indicators may be used to record several types of errors e.g., fatal errors
(missing primary key value), missing data in a required field, wrong data type/format,
or invalid data value. If a record contains even one error, data quality (DQ) fields will
be appended to the end of the record, one field for every field in the record. A data
quality indicator code is included in the DQ fields corresponding to the original fields in
the record where the errors were encountered. Records containing a fatal error are
stored in a Rejected Record Table and associated to the original file name and record
number. These records cannot be loaded to the target because they lack a primary key
field to be used as a unique record identifier in the target.
The following types of errors cannot be processed:
A source record does not contain a valid key. This record would be sent to a reject
queue. Metadata will be saved and used to generate a notice to the sending
system indicating that x number of invalid records were received and could not
be processed. However, in the absence of a primary key, no tracking is possible
to determine whether the invalid record has been replaced or not.
The source file or record is illegible. The file or record would be sent to a reject
queue. Metadata indicating that x number of invalid records were received and

could not be processed may or may not be available for a general notice to be
sent to the sending system. In this case, due to the nature of the error, no
tracking is possible to determine whether the invalid record has been replaced or
not. If the file or record is illegible, it is likely that individual unique records
within the file are not identifiable. While information can be provided to the
source system site indicating there are file errors for x number of records,
specific problems may not be identifiable on a record-by-record basis.
In these error types, the records can be processed, but they contain errors:
A required (non-key) field is missing.
The value in a numeric or date field is non-numeric.
The value in a field does not fall within the range of acceptable values identified for
the field. Typically, a reference table is used for this validation.
When an error is detected during ingest and cleansing, the identified error type is
recorded.
Quality Indicators (Quality Code Table)
The requirement to validate virtually every data element received from the source data
systems mandates the development, implementation, capture and maintenance of
quality indicators. These are used to indicate the quality of incoming data at an
elemental level. Aggregated and analyzed over time, these indicators provide the
information necessary to identify acute data quality problems, systemic issues, business
process problems and information technology breakdowns.
The quality indicators: 0-No Error, 1-Fatal Error, 2-Missing Data from a Required
Field, 3-Wrong Data Type/Format, 4-Invalid Data Value and 5-Outdated Reference
Table in Use, apply a concise indication of the quality of the data within specific fields
for every data type. These indicators provide the opportunity for operations staff, data
quality analysts and users to readily identify issues potentially impacting the quality of
the data. At the same time, these indicators provide the level of detail necessary for
acute quality problems to be remedied in a timely manner.
Handling Data Errors
The need to periodically correct data in the target is inevitable. But how often should
these corrections be performed?
The correction process can be as simple as updating field information to reflect actual
values, or as complex as deleting data from the target, restoring previous loads from
tape, and then reloading the information correctly. Although we try to avoid performing
a complete database restore and reload from a previous point in time, we cannot rule
this out as a possible solution.
Reject Tables vs. Source System
As errors are encountered, they are written to a reject file so that business analysts can
examine reports of the data and the related error messages indicating the causes of
error. The business needs to decide whether analysts should be allowed to fix data in

the reject tables, or whether data fixes will be restricted to source systems. If errors
are fixed in the reject tables, the target will not be synchronized with the source
systems. This can present credibility problems when trying to track the history of
changes in the target data architecture. If all fixes occur in the source systems, then
these fixes must be applied correctly to the target data.
Attribute Errors and Default Values
Attributes provide additional descriptive information about a dimension concept.
Attributes include things like the color of a product or the address of a store. Attribute
errors are typically things like an invalid color or inappropriate characters in the
address. These types of errors do not generally affect the aggregated facts and
statistics in the target data; the attributes are most useful as qualifiers and filtering
criteria for drilling into the data, (e.g. to find specific patterns for market research).
Attribute errors can be fixed by waiting for the source system to be corrected and
reapplied to the data in the target.
When attribute errors are encountered for a new dimensional value, default values can
be assigned to let the new record enter thetarget. Some rules that have been proposed
for handling defaults are as follows:
Value Types Description Default
Reference Values Attributes that are foreign
keys to other tables
Unknown
Small Value Sets Y/N indicator fields No
Other Any other type of attribute Null or Business
provided value
Reference tables are used to normalize the target model to prevent the duplication of
data. When a source value does not translate into a reference table value, we use the
Unknown value. (All reference tables contain a value of Unknown for this purpose.)
The business should provide default values for each identified attribute. Fields that are
restricted to a limited domain of values (e.g. On/Off or Yes/No indicators), are referred
to as small value sets. When errors are encountered in translating these values, we use
the value that represents off or No as the default. Other values, like numbers, are
handled on a case-by-case basis. In many cases, the data integration process is set to
populate Null into these fields, which means undefined in the target. After a source
system value is corrected and passes validation, it is corrected in the target.
Primary Key Errors
The business also needs to decide how to handle new dimensional values such as
locations. Problems occur when the new key is actually an update to an old key in the
source system. For example, a location number is assigned and the new location is
transferred to the target using the normal process; then the location number is
changed due to some source business rule such as: all Warehouses should be in the
5000 range. The process assumes that the change in the primary key is actually a new
warehouse and that the old warehouse was deleted. This type of error causes a
separation of fact data, with some data being attributed to the old primary key and
some to the new. An analyst would be unable to get a complete picture.

Fixing this type of error involves integrating the two records in the target data, along
with the related facts. Integrating the two rows involves combining the Profile
information, taking care to coordinate the effective dates of the Profiles to sequence
properly. If two Profile records exist for the same day, then a manual decision is
required as to which is correct. If facts were loaded using both primary keys, then the
related fact rows must be added together and the originals deleted in order to correct
the data.
The situation is more complicated when the opposite condition occurs (i.e., two primary
keys mapped to the same target data ID really represent two different IDs). In this
case, it is necessary to restore the source information for both dimensions and facts
from the point in time at which the error was introduced, deleting affected records from
the target and reloading from the restore to correct the errors.
DM Facts Calculated from EDW Dimensions
If information is captured as dimensional data from the source, but used as measures
residing on the fact records in the target, we must decide how to handle the facts. From
a data accuracy view, we would like to reject the fact until the value is corrected. If we
load the facts with the incorrect data, the process to fix the target can be time
consuming and difficult to implement.
If we let the facts enter downstream target structures, we need to create processes
that update them after the dimensional data is fixed. If we reject the facts when these
types of errors are encountered, the fix process becomes simpler. After the errors are
fixed, the affected rows can simply be loaded and applied to the target data.
Fact Errors
If there are no business rules that reject fact records except for relationship errors to
dimensional data, then when we encounter errors that would cause a fact to be
rejected, we save these rows to a reject table for reprocessing the following night. This
nightly reprocessing continues until the data successfully enters the target data
structures. Initial and periodic analyses should be performed on the errors to determine
why they are not being loaded.
Data Stewards
Data Stewards are generally responsible for maintaining reference tables and
translation tables, creating new entities in dimensional data, and designating one
primary data source when multiple sources exist. Reference data and translation tables
enable the target data architecture to maintain consistent descriptions across multiple
source systems, regardless of how the source system stores the data. New entities in
dimensional data include new locations, products, hierarchies, etc. Multiple source data
occurs when two source systems can contain different data for the same dimensional
entity.
Reference Tables
The target data architecture may use reference tables to maintain consistent
descriptions. Each table contains a short code value as a primary key and a long

description for reporting purposes. A translation table is associated with each reference
table to map the codes to the source system values. Using both of these tables, the ETL
process can load data from the source systems into the target structures.
The translation tables contain one or more rows for each source value and map the
value to a matching row in the reference table. For example, the SOURCE column in
FILE X on System X can contain O, S or W. The data steward would be responsible
for entering in the Translation table the following values:
Source Value Code Translation
O OFFICE
S STORE
W WAREHSE
These values are used by the data integration process to correctly load the target.
Other source systems that maintain a similar field may use a two-letter abbreviation
like OF, ST and WH. The data steward would make the following entries into the
translation table to maintain consistency across systems:
Source Value Code Translation
OF OFFICE
ST STORE
WH WAREHSE
The data stewards are also responsible for maintaining the Reference table that
translates the Codes into descriptions. The ETL process uses the Reference table to
populate the following values into the target:
Code Translation Code Description
OFFICE Office
STORE Retail Store
WAREHSE Distribution Warehouse
Error handling results when the data steward enters incorrect information for these
mappings and needs to correct them after data has been loaded. Correcting the above
example could be complex (e.g., if the data steward entered ST as translating to
OFFICE by mistake). The only way to determine which rows should be changed is to
restore and reload source data from the first time the mistake was entered. Processes
should be built to handle these types of situations, including correction of the entire
target data architecture.
Dimensional Data
New entities in dimensional data present a more complex issue. New entities in the
target may include Locations and Products, at a minimum. Dimensional data uses the
same concept of translation as Reference tables. These translation tables map the
source system value to the target value. For location, this is straightforward, but over
time, products may have multiple source system values that map to the same product

in the target. (Other similar translation issues may also exist, but Products serves as a
good example for error handling.)
There are two possible methods for loading new dimensional entities. Either require the
data steward to enter the translation data before allowing the dimensional data into the
target, or create the translation data through the ETL process and force the data
steward to review it. The first option requires the data steward to create the translation
for new entities, while the second lets the ETL process create the translation, but marks
the record as Pending Verification until the data steward reviews it and changes the
status to Verified before any facts that reference it can be loaded.
When the dimensional value is left as Pending Verification however, facts may be
rejected or allocated to dummy values. This requires the data stewards to review the
status of new values on a daily basis. A potential solution to this issue is to generate an
e-mail each night if there are any translation table entries pending verification. The
data steward then opens a report that lists them.
A problem specific to Product is that when it is created as new, it is really just a
changed SKU number. This causes additional fact rows to be created, which produces
an inaccurate view of the product when reporting. When this is fixed, the fact rows for
the various SKU numbers need to be merged and the original rows deleted. Profiles
would also have to be merged, requiring manual intervention.
The situation is more complicated when the opposite condition occurs (i.e., two
products are mapped to the same product, but really represent two different products).
In this case, it is necessary to restore the source information for all loads since the
error was introduced. Affected records from the target should be deleted and then
reloaded from the restore to correctly split the data. Facts should be split to allocate the
information correctly and dimensions split to generate correct Profile information.
Manual Updates
Over time, any system is likely to encounter errors that are not correctable using
source systems. A method needs to be established for manually entering fixed data and
applying it correctly to the entire target data architecture, including beginning and
ending effective dates. These dates are useful for both Profile and Date Event fixes.
Further, a log of these fixes should be maintained to enable identifying the source of
the fixes as manual rather than part of the normal load process.
Multiple Sources
The data stewards are also involved when multiple sources exist for the same data. This
occurs when two sources contain subsets of the required information. For example, one
system may contain Warehouse and Store information while another contains Store and
Hub information. Because they share Store information, it is difficult to decide which
source contains the correct information.
When this happens, both sources have the ability to update the same row in the target.
If both sources are allowed to update the shared information, data accuracy and Profile
problems are likely to occur. If we update the shared information on only one source
system, the two systems then contain different information. If the changed system is

loaded into the target, it creates a new Profile indicating the information changed.
When the second system is loaded, it compares its old unchanged value to the new
Profile, assumes a change occurred and creates another new Profile with the old,
unchanged value. If the two systems remain different, the process causes two Profiles
to be loaded every day until the two source systems are synchronized with the same
information.
To avoid this type of situation, the business analysts and developers need to designate,
at a field level, the source that should be considered primary for the field. Then, only if
the field changes on the primary source would it be changed. While this sounds simple,
it requires complex logic when creating Profiles, because multiple sources can provide
information toward the one Profile record created for that day.
One solution to this problem is to develop a system of record for all sources. This allows
developers to pull the information from the system of record, knowing that there are no
conflicts for multiple sources. Another solution is to indicate, at the field level, a
primary source where information can be shared from multiple sources. Developers can
use the field level information to update only the fields that are marked as primary.
However, this requires additional effort by the data stewards to mark the correct source
fields as primary and by the data integration team to customize the load process.


Error Handling Techniques using PowerCenter 7 (PC7)
and PowerCenter Metadata Reporter (PCMR)
Challenge
Implementing an efficient strategy to identify different types of errors in the ETL
process, correct the errors, and reprocess the corrected data.
Description
Identifying errors and creating an error handling strategy is an essential part of a data
warehousing project. The errors in an ETL process can be broadly categorized into two
types: data errors in the load process, which are defined by the standards of
acceptable data quality; and process errors, which are driven by the stability of the
process itself.
The first step in implementing an error handling strategy is to understand and define
the error handling requirement. Consider the following questions:
What tools and methods can help in detecting all the possible errors?
What tools and methods can help in correcting the errors?
What is the best way to reconcile data across multiple systems?
Where and how will the errors be stored? (i.e., relational tables or flat files)
A robust error handling strategy can be implemented using PowerCenters built-in error
handling capabilities along with the PowerCenter Metadata Reporter (PCMR) as follows:
Process Errors: Configure an email task to notify the PowerCenter Administrator
immediately of any process failures.
Data Errors: Setup the ETL process to:
o Use the Row Error Logging feature in PowerCenter to capture data errors
in the PowerCenter error tables for analysis, correction, and reprocessing.
o Setup PCMR alerts to notify the PowerCenter Administrator in the event of
any rejected rows.
o Setup customized PCMR reports and dashboards at the project level to
provide information on failed sessions, sessions with failed rows, load
time, etc.
Configuring an Email Task to Handle Process Failures

Configure all workflows to send an email to the PowerCenter Administrator, or any
other designated recipient, in the event of a session failure. Create a reusable email
task and use it in the On Failure Email property settings in the Components tab of the
session, as shown in the following figure:

When you configure the subject and body of a post-session email, use email variables
to include information about the session run, such as session name, mapping name,
status, total number of records loaded, and total number of records rejected. The
following table lists all the available email variables:

Email Variables for Post-Session Email
Email
Variable
Description
%s Session name.
%e Session status.
%b Session start time.
%c Session completion time.
%i Session elapsed time (session completion time-session start time).
%l Total rows loaded.
%r Total rows rejected.
%t
Source and target table details, including read throughput in bytes per
second and write throughput in rows per second. The PowerCenter Server
includes all information displayed in the session detail dialog box.

Email Variables for Post-Session Email
Email
Variable
Description
%m Name of the mapping used in the session.
%n Name of the folder containing the session.
%d Name of the repository containing the session.
%g Attach the session log to the message.
%a<filename>
Attach the named file. The file must be local to the PowerCenter Server.
The following are valid file names: %a<c:\data\sales.txt> or
%a</users/john/data/sales.txt>.
Note: The file name cannot include the greater than character (>) or a
line break.
Note: The PowerCenter Server ignores %a, %g, or %t when you include them in the
email subject. Include these variables in the email message only.
Configuring Row Error Logging in PowerCenter
PowerCenter provides you with a set of four centralized error tables into which all data
errors can be logged. Using these tables to capture data errors greatly reduces the time
and effort required to implement an error handling strategy when compared with a
custom error handling solution.
When you configure a session, you can choose to log row errors in this central location.
When a row error occurs, the PowerCenter Server logs error information that allows you
to determine the cause and source of the error. The PowerCenter Server logs
information such as source name, row ID, current row data, transformation, timestamp,
error code, error message, repository name, folder name, session name, and mapping
information. This error metadata is logged for all row level errors, including database
errors, transformation errors, and errors raised through the ERROR() function, such as
business rule violations.
Logging row errors into relational tables rather than flat files enables you to report on
and fix the errors easily. When you enable error logging and chose the Relational
Database Error Log Type, the PowerCenter Server offers you the following features:
Generates the following tables to help you track row errors:
o PMERR_DATA. Stores data and metadata about a transformation row
error and its corresponding source row.
o PMERR_MSG. Stores metadata about an error and the error message.
o PMERR_SESS. Stores metadata about the session.
o PMERR_TRANS. Stores metadata about the source and transformation
ports, such as name and datatype, when a transformation error occurs.
Appends error data to the same tables cumulatively, if they already
exist, for the further runs of the session.
Allows you to specify a prefix for the error tables. For instance, if
you want all your EDW session errors to go to one set of error
tables, you can specify the prefix as EDW_
Allows you to collect row errors from multiple sessions in a
centralized set of four error tables. To do this, you specify the
same error log table name prefix for all sessions.

Example:

In the following figure, the session s_m_Load_Customer loads Customer Data into the
EDW Customer table. The Customer Table in EDW has the following structure:
CUSTOMER_ID NOT NULL NUMBER (PRIMARY KEY)
CUSTOMER_NAME NULL VARCHAR2(30)
CUSTOMER_STATUS NULL VARCHAR2(10)
There is a primary key constraint on the column CUSTOMER_ID.
To take advantage of PowerCenters built-in error handling features, you would set the
session properties as shown below:

The session property Error Log Type is set to Relational Database, and Error Log DB
Connection and Table name Prefix values are given accordingly.
When the PowerCenter server detects any rejected rows because of Primary Key
Constraint violation, it writes information into the Error Tables as shown below:
EDW_PMERR_DATA:

REPOSITORY_
GID
WORKFLOW_
RUN_ID
WORKLET_
RUN_ID
SESS_
INST_ID
TRANS_
MAPPLET_INST
TRANS_NAME TRANS_GROUP
37379c74-
f4b5-4dc7-
a927-
3b38c9ec09ca
8 0 3 N/A Customer_Table Input
37379c74-
f4b5-4dc7-
a927-
3b38c9ec09ca
37379c74-
f4b5-4dc7-
a927-
3b38c9ec09ca

EDW_PMERR_MSG:
REPOSITORY_
GID
WORKFLOW_
RUN_ID
WORKLET_
RUN_ID
SESS_
INST_ID
SESS_
START_TIME
SES_
UTC_TIME
REPOSITORY_
NAME
FOLDER_
NAME
37379c74-
f4b5-4dc7-
a927-
3b38c9ec09ca
6 0 3 9/15/2004
18:31
9/15/2004
18:31
pc711 Folder1
37379c74-
f4b5-4dc7-
a927-
3b38c9ec09ca
7 0 3 9/15/2004
18:33
9/15/2004
18:33
pc711 Folder1
37379c74-
f4b5-4dc7-
a927-
3b38c9ec09ca
8 0 3 9/15/2004
18:34
9/15/2004
18:34
pc711 Folder1
EDW_PMERR_SESS:
REPOSITORY_
GID
WORKFLOW_
RUN_ID
WORKLET_
RUN_ID
SESS_
INST_ID
SESS_
START_TIME
SES_
UTC_TIME
REPOSITORY_
NAME
FOLDER_
NAME
37379c74-
f4b5-4dc7-
a927-
3b38c9ec09ca
6 0 3 9/15/2004
18:31
9/15/2004
18:31
pc711 Folder1
37379c74-
f4b5-4dc7-
a927-
3b38c9ec09ca
7 0 3 9/15/2004
18:33
9/15/2004
18:33
pc711 Folder1
37379c74-
f4b5-4dc7-
a927-
3b38c9ec09ca
8 0 3 9/15/2004
18:34
9/15/2004
18:34
pc711 Folder1

EDW_PMERR_TRANS:
REPOSITORY_
GID
WORKFLOW_
RUN_ID
WORKLET_
RUN_ID
SESS_
INST_ID
TRANS_
MAPPLET_INST
TRANS_
NAME
TRANS_
GROUP
TRANS_
ATTR
37379c74-
f4b5-4dc7-
a927-
3b38c9ec09ca
8 0 3 N/A Customer_Table Input Customer
_Id:3,
Customer
_Name:12,
Customer
_Status:12
By looking at the workflow run id and other fields, you can easily analyze the errors and
reprocess them after fixing the errors.
Error Detection and Notification using PCMR
Informatica provides PowerCenter Metadata Reporter (PCMR) with every PowerCenter
license. The PCMR uses Informaticas powerful business intelligence tool,
PowerAnalyzer, to provide insight into the PowerCenter repository metadata.
You can use the Operations Dashboard of the PCMR as one central location to gain
insight into production environment ETL activities. In addition, the following capabilities
of the PCMR are recommended best practices:
Configure PCMR alerts to send an email or a pager message to the PowerCenter
Administrator whenever there is an entry made into the error tables
PMERR_DATA or PMERR_TRANS.
Configure reports and dashboards using the PCMR to provide detailed session run
information grouped by projects/PowerCenter folders for easy analysis.
Configure reports in PCMR to provide detailed information of the row level errors
for each session. This can be accomplished by using the four error tables as
sources of data for the reports.
Error Correction and Reprocessing
The method of error correction depends on the type of error that occurred. Here are a
few things that you should consider during error correction:
The owner of the data should always fix the data errors. For example, if the
source data is coming from an external system, then you should send the errors
back to the source system to be fixed.
In some situations, a simple re-execution of the session will reprocess the data.
You may be able to modify the SQL or some other session property to make
sure that no duplicate data is processed during the re-run of the session and
that all data is processed correctly.
In some situations, partial data that has been loaded into the target systems
should be backed out in order to avoid duplicate processing of rows.
o Having a field in every target table, such as a BATCH_ID field, to identify
each unique run of the session can help greatly in the process of backing
out partial loads, but sometimes you may need to design a special
mapping to achieve this.

Lastly, errors can also be corrected through a manual SQL load of the data. If the
volume of errors is low, the rejected data can be easily exported to Microsoft
Excel or CSV format and corrected in a spreadsheet from the PCMR error
reports. Then the corrected data can be manually inserted into the target table
using a SQL statement.
Any approach to correct erroneous data should be precisely documented and followed
as a standard.
If the data errors occur frequently, then the reprocessing process can be automated by
designing a special mapping or session to correct the errors and load the corrected data
into the ODS or staging area.
Data Reconciliation using PowerAnalyzer
Business users often like to see certain metrics matching from one system to another
(e.g., source system to ODS, ODS to targets, etc.) to ascertain that the data has been
processed accurately. This is frequently accomplished by writing tedious queries,
comparing two separately produced reports, or using constructs such as DBLinks.
By upgrading the PCMR from a limited-use license that can source the PowerCenter
repository metadata only to a full-use PowerAnalyzer license that can source your
companys data (e.g., source systems, staging areas, ODS, data warehouse, and data
marts), PowerAnalyzer provides a reliable and reusable way to accomplish data
reconciliation. Using PowerAnalyzers reporting capabilities, you can select data from
various data sources such as ODS, data marts and data warehouses to compare key
reconciliation metrics and numbers through aggregate reports. You can further
schedule the reports to run automatically every time the relevant PowerCenter sessions
complete, and setup alerts to notify the appropriate business or technical users in case
of any discrepancies.
For example, a report can be created to ensure that the same number of customers
exist in the ODS as well in the data warehouse and/or any downstream data marts. The
reconciliation reports should be relevant to a business user by comparing key metrics
(e.g., customer counts, aggregated financial metrics, etc) across data silos. Such
reconciliation reports can be run automatically after PowerCenter loads the data, or
they can be run by technical users or business on demand. This process allows users to
verify the accuracy of data and build confidence in the data warehouse solution.


Error Management in a Data Warehouse Environment
Challenge
A key requirement for any successful data warehouse or data integration project is that
it attain credibility within the user community. At the same time, it is imperative that
the warehouse be as up-to-date as possible since the more recent the information
derived from it is, the more relevant it is to the business operations of the organization,
thereby providing the best opportunity to gain an advantage over the competition.
Transactional systems can manage to function even with a certain amount of error
since the impact of an individual transaction (in error) has a limited effect on the
business figures as a whole, and corrections can be applied to erroneous data after the
event (i.e., after the error has been identified). In data warehouse systems, however,
any systematic error (e.g., for a particular load instance) not only affects a larger
number of data items, but may potentially distort key reporting metrics. Such data
cannot be left in the warehouse "until someone notices" because business decisions
may be driven by such information.
Therefore, it is important to proactively manage errors, identifying them before, or as,
they occur. If errors occur, it is equally important either to prevent them from getting
to the warehouse at all, or to remove them from the warehouse immediately (i.e.,
before the business tries to use the information in error).
The types of error to consider include:
Source data structures
Sources presented out-of-sequence
Old sources represented in error
Incomplete source files
Data-type errors for individual fields
Unrealistic values (e.g., impossible dates)
Business rule breaches
Missing mandatory data
O/S errors
RDBMS errors
These cover both high-level (i.e., related to the process or a load as a whole) and low-
level (i.e., field or column-related errors) concerns.

Description
In an ideal world, when an analysis is complete, you have a precise definition of source
and target data; you can be sure that every source element was populated correctly,
with meaningful values, never missing a value, and fulfilling all relational constraints. At
the same time, source data sets always have a fixed structure, are always available on
time (and in the correct order), and are never corrupted during transfer to the data
warehouse. In addition, the OS and RDBMS never run out of resources, or have
permissions and privileges change.
Realistically, however, the operational applications are rarely able to cope with every
possible business scenario or combination of events; and operational systems crash,
networks fall over, and users may not use the transactional systems in quite the way
they were designed. The operational systems also typically need to allow some
flexibility to allow non-fixed data to be stored (typically as free-text comments). In
every case, there is a risk that the source data does not match what the data
warehouse expects.
Because of the credibility issue, in-error data cannot be allowed to get to the metrics
and measures used by the business managers. If such data does reach the warehouse,
it must be identified as such, and removed immediately (before the current version of
the warehouse can be published). Even better, however, is for such data to be
identified during the load process and prevented from reaching the warehouse at all.
Best of all is for erroneous source data to be identified before a load even begi ns, so
that no resources are wasted trying to load it.
The principle to follow for correction of errors should be to ensure that the data is
corrected at the source. As soon as any attempt is made to correct errors within the
warehouse, there is a risk that the lineage and provenance of data will be lost. From
that point on, it becomes impossible to guarantee that a metric or data item came from
a specific source via a specific chain of processes. As a by-product, such a principle also
helps to tie both the end-users and those responsible for the source data into the
warehouse process; source data staff understand that their professionalism directly
affects the quality of the reports, and end-users become owners of their data.
As a final consideration, error management complements and overlaps load
management, data quality and key management, and operational processes and
procedures. Load management processes record at a high-level if a load is
unsuccessful; error management records the details of why the failure occurred. Quality
management defines the criteria whereby data can be identified as in error; and error
management identifies the specific error(s), thereby allowing the source data to be
corrected. Operational reporting shows a picture of loads over time, and error
management allows analysis to identify systematic errors, perhaps indicating a failure
in operational procedure.
A key tool for all of these systems is the effective creation, and use of metadata. Such
metadata encompasses operational, field-level, loading process, business rule, and
relational areas and is integral to a proactively-managed data warehouse.
Error Management Considerations

High-Level Issues
From previous discussion of load management, a number of checks can be performed
before any attempt is made to load a source data set. Without load management in
place, it is unlikely that the warehouse process will be robust enough to satisfy any
end-user requirements, and error correction processing becomes moot (in so far as
nearly all maintenance and development resources will be working full time to
manuallycorrect bad data in the warehouse). The following assumes that you have
implemented load management processes similar to Informaticas best practices.
Process Dependency checks in the load management can identify when a source
data set is missing, duplicates a previous version, or has been presented out of
sequence, and where the previous load failed but has not yet been corrected.
Load management prevents this source data from being loaded. At the same time,
error management processes should record the details of the failed load; noting
the source instance, the load affected, and when and why the load was aborted.
Source file structures can be compared to expected structures stored as metadata,
either from header information or by attempting to read the first data row.
Source table structures can be compared to expectations; typically this can be
done by interrogating the RDBMS catalogue directly (and comparing to the
expected structure held in metadata), or by simply running a describe
command against the table (again comparing to a pre-stored version in
metadata).
Control file totals (for file sources) and row number counts (table sources) are also
used to determine if files have been corrupted or truncated during transfer, or if
tables have no new data in them (suggesting a fault in an operational
application).
In every case, information should be recorded to identify where and when an error
occurred, what sort of error it was, and any other relevant process-level details.
Low-Level Issues
Assuming that the load is to be processed normally (i.e., that the high-level checks
have not caused the load to abort), further error management processes need to be
applied to the individual source rows and fields.
Individual source fields can be compared to expected data-types against standard
metadata within the repository, or additional information added by the
development. In some instances, this will be enough to abort the rest of the
load. Since if the field structure is incorrect, it is much more likely that the
source data set as a whole either cannot be processed at all, or (more
worrisome) will be processed unpredictably.
Data conversion errors can be identified on a field-by-field basis within the body of
a mapping. Built-in error handling can be used to spot failed date conversions,
conversions of string to numbers, missing required data. In rare cases, stored
procedures can be called if a specific conversion fails; however this cannot be
generally recommended because of the potentially crushing impact on
performance if a particularly error-filled load occurs.
Business rule breaches will then be picked up. It is possible to define allowable
values, or acceptable value ranges within PowerCenter mappings (if the rules
are few, and it is clear from the mapping metadata that the business rules are
included in the mapping itself). A more flexible approach is to use external

tables to codify the business rules. In this way, only the rules tables need to be
amended if a new business rule needs to be applied. Informatica has suggested
methods to implement such a process.
Missing Key/Unknown Key issues have already been defined in their own best
practice document, Key Management in Data Warehousing Solutions, with
suggested management techniques for identifying and handling them. However,
from an error handling perspective, such errors must still be identified and
recorded, even when key management techniques do not formally fail source
rows with key errors. Unless a record is kept of the frequency with which
particular source data fails, it is difficult to realize when there is a systematic
problem in the source systems.
Inter-row errors may also have to be considered. These may occur when a
business process expects a certain hierarchy of events (e.g., a customer query,
followed by a booking request, followed by a confirmation, followed by a
payment). If the events arrive from the source system in the wrong order, or
where key events are missing, it may indicate a major problem with the source
system, or the way in which the source system is being used.
An important principle to follow should be to try to identify all of the errors on a
particular row before halting processing, rather than rejecting the row at the
first instance. This seems to break the rule of not wasting resources trying to
load a sourced data set if we already know it is in error; however, since the row
will need to be corrected at source, then reprocessed subsequently, it is sensible
to identify all the corrections that need to be made before reloading, rather than
fixing the first, re-running, and then identifying a second error (which halts the
load for a second time).
OS and RDBMS Issues
Since best practice means that referential integrity (RI) issues are proactively managed
within the loads, instances where the RDBMS rejects data for referential reasons should
be very rare (i.e., the load should already have identified that reference information is
missing).
However, there is little that can be done to identify that more generic RDBMS problems
will occur; changes to schema permissions, running out of temporary disk space,
dropping of tables and schemas, invalid indexes, no further table space extents
available, missing partitions and the like.
Similarly, interaction with the OS means that changes in directory structures, file
permissions, disk space, command syntax, and authentication may occur outside of the
data warehouse. Often such changes are driven by Systems Administrators who, from
an operational perspective, are not aware that there will be an impact on the data
warehouse, or are not aware that the data warehouse managers need to be kept up to
speed.
In both of the instances above, the nature of the errors may be such that not only will
they cause a load to fail, but it may be impossible to record the nature of the error at
that point in time. For example, if RDBMS user ids are revoked, it may be impossible to
write a row to an error table if the error process depends on the revoked id; if disk
space runs out during a write to a target table, this may affect all other tables
(including the error tables); if file permissions on a UNIX host are amended, bad files
themselves (or even the log files) may not be able to be written to.

Most of these types of issues can be managed by a proper load management process,
however. Since setting the status of a load to complete should be absolutely the last
step in a given process, any failure before, or including, that point leaves the load in an
incomplete state. Subsequent runs will note this, and enforce correction of the last
load before beginning the new one.
The best practice to manage such OS and RDBMS errors is, therefore, to ensure that
the Operational Administrators and DBAs have proper and working communication with
the data warehouse management to allow proactive control of changes. Administrators
and DBAs should also be available to the data warehouse operators to rapidly explain
and resolve such errors if they occur.
Auto-Correction vs. Manual Correction
Load management and key management best practices (Key Management in Data
Warehousing Solutions)have already defined auto-correcting processes; the former to
allow loads themselves to launch, rollback, and reload without manual intervention, and
the latter to allow RI errors to be managed so that the quantitative quality of the
warehouse data is preserved, and incorrect key values are corrected as soon as the
source system provides the missing data.
We cannot conclude from these two specific techniques, however, that the warehouse
should attempt to change source data as a general principle. Even if this were possible
(which is debatable), such functionality would mean that the absolute link between the
source data and its eventual incorporation into the data warehouse would be lost. As
soon as one of the warehouse metrics was identified as incorrect, unpicking the error
would be impossible, potentially requiring a whole section of the warehouse to be
reloaded entirely from scratch.
In addition, such automatic correction of data might hide the fact that one or other of
the source systems had a generic fault, or more importantly, had acquired a fault
because of on-going development of the transactional applications, or a failure in user
training.
The principle to apply here is to identify the errors in the load, and then alert the source
system users that data should be corrected in the source system itself, ready for the
next load to pick up the right data. This maintains the data lineage, allows source
system errors to be identified and ameliorated in good time, and permits extra training
needs to be identified and managed.
Error Management Techniques
Simple Error Handling Structure


This simple example defines three main sets of information:
The Error_Definition table simply stores descriptions for the various types of
errors, including process-level (e.g., incorrect source file, load started out-of-
sequence), row-level (e.g., missing foreign key, incorrect data-type, conversion
errors), and reconciliation (e.g., incorrect row numbers, incorrect file total etc.).
The Error_Header provides a high-level view on the process, allowing a quick
identification of the frequency of error for particular loads and of the distribution
of error types. It is linked to the load management processes via the
Src_Inst_ID and Proc_Inst_ID, from which other process-level information can
be gathered.
The Error_Detail stores information about actual rows with errors, including how to
identify the specific row that was in error (using the source natural keys and row
number) together with a string of field identifier/value pairs concatenated
together. It is NOT expected that this information will be deconstructed as part
of an automatic correction load, but if necessary this can be pivoted (e.g., using
simple UNIX scripts) to separate out the field/value pairs for subsequent
reporting.

Error Management Process Flow
Error management must fit into the load process as a whole, although the
implementation depends on the particular data warehouse. Typically, mapping
templates are created with the necessary objects to interact with the load management
and error management control tables; these are then added to or adapted with the
specific transformations to fulfil each load requirement. In many instances common
transformations are created to perform error description lookups, business rule
validation, and metadata queries; these are then referenced as and when a given data
item within a transformation requires them.
In any case, error management, load management, metadata, and the load itself are
intimately connected; it is the integration of all these approaches that provides the
robust system that is needed to successfully generate the data warehouse. The
following diagram illustrates the integrated process.



Error Management Process Flow
Challenge
Error management must fit into the load process as a whole. The specific
implementation depends on the particular data warehouse requirements.
Error management involves the following three steps:
Error identification
Error retrieval
Error correction
The Best Practice focuses on the process for implementing each of these steps in a
PowerCenter architecture.
Description
A typical error management process leverages the best-of-breed error management
technology available in PowerCenter, such as relational database error logging, email
notification of workflow failures, session error thresholds, PowerCenter Metadata
Reporter (PCMR) reporting capabilities, and data profiling and integrates them with the
load process and metadata to provide a seamless load process.
Error Identification
The first step to error management is error identification. Error identification is most
often achieved through enabling referential integrity constraints at the database level
and enabling relational error logging in PowerCenter. This approach ensures that all
row-level, referential integrity errors are identified by the database and captured in the
relational error handling tables in the PowerCenter repository. By enabling relational
error logging, all row-level errors can automatically be written to a centralized set of
four error handling tables.
These four tables store information such as error messages, error data, and source row
data. These tables include PMERR_MSG, PMERR_DATA, PMERR_TRANS, and
PMERR_SESS. Examples of row-level errors include database errors, transformation
errors, and business rule exceptions for which the ERROR() function has been called
within the mapping.

Error Retrieval
The second step to error management is error retrieval. After errors have been
captured in the PowerCenter repository, it is important to make the retrieval of these
errors simple and automated in order to make the error management process as
efficient as possible. The PCMR should be customized to create error retrieval reports to
extract this information from the PowerCenter repository. A typical error report prompts
a user for the folder and workflow name, and returns a report with information such as
the session, error message, and data that caused the error. In this way, the error is
successfully captured in the repository and can be easily retrieved through a PCMR
report, or an email alert that identifies a user when a certain threshold is crossed in a
report (such as number of errors is greater than zero).
Error Correction
The final step in error management is error correction. Since PowerCenter automates
the process of error identification, and PCMR simplifies error retrieval, the error
correction step is also simple. After retrieving an error through the PCMR, the error
report (which contains information such as workflow name, session name, error date,
error message, error data, and source row data) can be easily exported to various file
formats including Microsoft Excel, Adobe PDF, CSV, and others. Upon retrieval of an
error, the error report can be extracted into a supported format and emailed to a
developer or DBA to resolve the issue, or it can be entered into a defect management
tracking tool. The PCMR interface supports emailing a report directly through the web-
based interface to make the process even easier.
For further automation, a report broadcasting rule that emails the error report to a
developers email inbox can be set up to run on a pre-defined schedule. After the
developer or DBA identifies the condition that caused the error, a fix for the error can
be implemented. Depending on the type and cause of the error, a fix can be as simple
as a re-execution of the mapping, or as complex as a data repair. The exact method of
data correction depends on various factors such as the number of records with errors,
data availability requirements per SLA, and the level of data criticality to the business
unit(s).
Data Profiling Option
For organizations that want to identify data irregularities post-load but dont want to
reject such rows at load time, the PowerCenter Data Profiling option can be an
important part of the error management solution. The PowerCenter Data Profiling
option enables users to create data profiles through a wizard-driven GUI that provides
profile reporting such as orphan record identification, business rule violation, and data
irregularity identification (such as NULL or default values). Just as with the PCMR, the
PowerCenter Data Profiling option comes with a license to use PowerAnalyzer reports
that source the data profile warehouse to deliver data profiling information through an
intuitive BI tool. This is a recommended best practice since error handling reports and
data profile reports can be delivered to users through the same easy-to-use BI tool.
Integrating Error Management, Load Management, and Metadata
Error management, load management, metadata, and the load itself are intimately
connected; it is the integration of all these approaches that provides the robust system

needed to successfully generate the data warehouse. The following diagram illustrates
this integration process.


Creating Inventories of Reusable Objects & Mappings
Challenge
Successfully creating inventories of reusable objects and mappings, including
identifying potential economies of scale in loading multiple sources to the same target.
Description
Reusable Objects
The first step in creating an inventory of reusable objects is to review the business
requirements and look for any common routines/modules that may appear in more
than one data movement. These common routines are excellent candidates for reusable
objects. In PowerCenter, reusable objects can be single transformations (lookups,
filters, etc.), single tasks (command, email, and session), a set of tasks that allow you
to reuse a set of workflow logic in several workflows (worklets), or even a string of
transformations (mapplets).
Evaluate potential reusable objects by two criteria:
Is there enough usage and complexity to warrant the development of a common
object?
Are the data types of the information passing through the reusable object the
same from case to case or is it simply the same high-level steps with different
fields and data?
Common objects are sometimes created just for the sake of creating common
components when in reality, creating and testing the object does not save development
time or future maintenance. For example, if there is a simple calculation like
subtracting a current rate from a budget rate that will be used for two different
mappings, carefully consider whether the effort to create, test, and document the
common object is worthwhile. Often, it is simpler to add the calculation to both
mappings. However, if the calculation were to be performed in a number of mappings,
if it was very difficult, and if all occurrences would be updated following any change or
fix then this would be an ideal case for a reusable object. When you add instances of
a reusable transformation to mappings, you must be careful that changes you make to
the transformation do not invalidate the mapping or generate unexpected data. The
Designer stores each reusable transformation as metadata, separate from any mapping
that uses the transformation.

The second criterion for a reusable object concerns the data that will pass through the
reusable object. Many times developers see a situation where they may perform a
certain type of high-level process (e.g., filter, expression, update strategy, or in two or
more mappings. For example, if you have several fact tables that require a series of
dimension keys, you can create a mapplet containing a series of lookup transformations
to find eachdimension key. You can then use the mapplet in each fact table mapping,
rather than recreate the same lookup logic in each mapping. This seems like a great
candidate for a mapplet. However, after performing half of the mapplet work, the
developers may realize that the actual data or ports passing through the high-level
logic are totally different from case to case, thus making the use of a mapplet
impractical. Consider whether there is a practical way to generalize the common logicso
that it can be successfully applied to multiple cases. Remember, when creating a
reusable object, the actual object will be replicated in one to many mappings. Thus, in
each mapping using the mapplet or reusable transformation object, the same size and
number of ports must pass into and out of the mapping/reusable object.
Document the list of the reusable objects that pass this criteria test, providing a high-
level description of what each object will accomplish. The detailed design will occur in a
future subtask, but at this point the intent is to identify the number and functionality of
reusable objects that will be built for the project. Keep in mind that it will be impossible
to identify one hundred percent of the reusable objects at this point; the goal here is to
create an inventory of as many as possible, and hopefully the most difficult ones. The
remainder will be discovered while building the data integration processes.
Mappings
A mapping is a set of source and target definitions linked by transformation objects that
define the rules for data transformation. Mappings represent the data flow between
sources and targets. In a simple world, a single source table would populate a single
target table. However, in practice, this is usually not the case. Sometimes multiple
sources of data need to be combined to create a target table, and sometimes a single
source of data creates many target tables. The latter is especially true for mainframe
data sources where COBOL OCCURS statements litter the landscape. In a typical
warehouse or data mart model, each OCCURS statement decomposes to a separate
table.
The goal here is to create an inventory of the mappings needed for the project. For this
exercise, the challenge is to think in individual components of data movement. While
the business may consider a fact table and its three related dimensions as a single
object in the data mart or warehouse, five mappings may be needed to populate the
corresponding star schema with data (i.e., one for each of the dimension tables and two
for the fact table, each from a different source system).
Typically, when creating an inventory of mappings, the focus is on the target tables,
with an assumption that each target table has its own mapping, or sometimes multiple
mappings. While often true, if a single source of data populates multiple tables, this
approach yields multiple mappings. Efficiencies can sometimes be realized by loading
multiple tables from a single source. By simply focusing on the target tables, however,
these efficiencies can be overlooked.
A more comprehensive approach to creating the inventory of mappings is to create a
spreadsheet listing all of the target tables. Create a column with a number next to each

target table. For each of the target tables, in another column, list the source file or
table that will be used to populate the table. In the case of multiple source tables per
target, create two rows for the target, each with the same number, and list the
additional source(s) of data.
The table would look similar to the following:
Number Target Table Source
1 Customers Cust_File
2 Products Items
3 Customer_Type Cust_File
4 Orders_Item Tickets
4 Orders_Item Ticket_Items
When completed, the spreadsheet can be sorted either by target table or source table.
Sorting by source table can help determine potential mappings that create multiple
targets.
When using a source to populate multiple tables at once for efficiency, be sure to keep
restartabilty and reloadability in mind. The mapping will always load two or more target
tables from the source, so there will be no easy way to rerun a single table. In this
example, potentially the Customers table and the Customer_Type tables can be loaded
in the same mapping.
When merging targets into one mapping in this manner, give both targets the same
number. Then, re-sort the spreadsheet by number. For the mappings with multiple
sources or targets, merge the data back into a single row to generate the inventory of
mappings, with each number representing a separate mapping.
The resulting inventory would look similar to the following:
Number Target Table Source
1 Customers
Customer_Type
Cust_File
2 Products Items
4 Orders_Item Tickets
Ticket_Items
At this point, it is often helpful to record some additional information about each
mapping to help with planning and maintenance.
First, give each mapping a name. Apply the naming standards generated in 2.2 Design
Development Architecture. These names can then be used to distinguish mappings from
one other and also can be put on the project plan as individual tasks.
Next, determine for the project a threshold for a high, medium, or low number of target
rows. For example, in a warehouse where dimension tables are likely to number in the
thousands and fact tables in the hundred thousands, the following thresholds might
apply:

Low 1 to 10,000 rows
Medium 10,000 to 100,000 rows
High 100,000 rows +
Assign a likely row volume (high, medium or low) to each of the mappings based on the
expected volume of data to pass through the mapping. These high level estimates will
help to determine how many mappings are of high volume; these mappings will be the
first candidates for performance tuning.
Add any other columns of information that might be useful to capture about each
mapping, such as a high-level description of the mapping functionality, resource
(developer) assigned, initial estimate, actual completion time, or complexity rating.


Metadata Reporting and Sharing
Challenge
Using Informatica's suite of metadata tools effectively in the design of the end-user
analysis application.
Description
The Informatica tool suite can capture extensive levels of metadata but the amount of
metadata that is entered depends on the metadata strategy. Detailed information or
metadata comments can be entered for all repository objects (e.g. mapping, sources,
targets, transformations, ports etc.). Also, all information about column size and scale,
data types, and primary keys are stored in the repository. The decision on how much
metadata to create is often driven by project timelines. While it may be beneficial for a
developer to enter detailed descriptions of each column, expression, variable, etc, it will
also require extra amount of time and efforts to do so. But once that information is fed
to the Informatica repository ,the same information can be retrieved using Metadata
reporter any time. There are several out-of-box reports and customized reports can
also be created to view that information. There are several options available to export
these reports (e.g. Excel spreadsheet, Adobe .pdf file etc.). Informatica offers two ways
to access the repository metadata:
Metadata Reporter, which is a web-based application that allows you to run
reports against the repository metadata. This is a very comprehensive tool that
is powered by the functionality of Informaticas BI reporting tool, PowerAnalyzer.
It is included on the PowerCenter CD.
Because Informatica does not support or recommend direct reporting access to the
repository, even for Select Only queries, the second way of repository metadata
reporting is through the use of views written using Metadata Exchange (MX).

Metadata Reporter
The need for the Informatica Metadata Reporter arose from the number of clients
requesting custom and complete metadata reports from their repositories. Metadata
Reporter is based on the PowerAnalyzer and PowerCenter products. It provides
PowerAnalyzer dashboards and metadata reports to help you administer your day-to-
day PowerCenter operations, reports to access to every Informatica object stored in the
repository, and even reports to access objects in the PowerAnalyzer repository. The
architecture of the Metadata Reporter is web-based, with an Internet browser front end.

Because Metadata Reporter runs on PowerAnalyzer, you must have PowerAnalyzer
installed and running before you proceed with Metadata Reporter setup.
Metadata Reporter setup includes the following .XML files to be imported from the
PowerCenter CD in the same sequence as they are listed below:
Schemas.xml
Schedule.xml
GlobalVariables_Oracle.xml (This file is database specific, Informatica provides
GlobalVariable files for DB2, SQLServer, Sybase and Teradata. You need to
select the appropriate file based on your PowerCenter repository environment)
Reports.xml
Dashboards.xml
Note : If you have setup a new instance of PowerAnalyzer exclusively for Metadata
reporter, you should have no problem importing these files. However, if you are using
an existing instance of PowerAnalyzer which you currently use for some other reporting
purpose, be careful while importing these files. Some of the file (e.g., Global variables,
schedules, etc.) may already exist with the same name. You can rename the conflicting
objects.
The following are the folders that are created in PowerAnalyzer when you import the
above-listed files:
PowerAnalyzer Metadata Reporting - contains reports for PowerAnalyzer repository
itself e.g. Todays Login ,Reports accessed by Users Today etc.
PowerCenter Metadata Reports - contains reports for PowerCenter repository. To
better organize reports based on their functionality these reports are further
grouped into subfolders as following:
Configuration Management - contains a set of reports that provide detailed
information on configuration management, including deployment and label
details. This folder contains following subfolders:
o Deployment
o Label
o Object Version
Operations - contains a set of reports that enable users to analyze operational
statistics including server load, connection usage, run times, load times, number
of runtime errors, etc. for workflows, worklets and sessions. This folder contains
following subfolders:
o Session Execution
o Workflow Execution
PowerCenter Objects - contains a set of reports that enable users to identify all
types of PowerCenter objects, their properties, and interdependencies on other
objects within the repository. This folder contains following subfolders:
o Mappings
o Mapplets
o Metadata Extension
o Server Grids
o Sessions
o Sources
o Target
o Transformations

o Workflows
o Worklets
Security - contains a set of reports that provide detailed information on the users,
groups and their association within the repository.
Informatica recommends retaining this folder organization, adding new folders if
necessary.
The Metadata Reporter provides 44 standard reports which can be customized with the
use of parameters and wildcards. Metadata Reporter is accessible from any computer
with a browser that has access to the web server where the Metadata Reporter is
installed, even without the other Informatica client tools being installed on that
computer. The Metadata Reporter connects to the PowerCenter repository using JDBC
drivers. Be sure the proper JDBC drivers are installed for your database platform.
(Note: You can also use the JDBC to ODBC bridge to connect to the repository (e.g.,
Syntax - jdbc:odbc:<data_source_name>)
Metadata Reporter is comprehensive. You can run reports on any repository. The
reports provide information about all types of metadata objects.
Metadata Reporter is easily accessible. Because the Metadata Reporter is web-
based, you can generate reports from any machine that has access to the web
server. The reports in the Metadata Reporter are customizable. The Metadata
Reporter allows you to set parameters for the metadata objects to include in the
report.
The Metadata Reporter allows you to go easily from one report to another. The
name of any metadata object that displays on a report links to an associated
report. As you view a report, you can generate reports for objects on which you
need more information.
The following table shows list of reports provided by the Metadata Reporter, along with
their location and a brief description:
Reports For PowerCenter Repository
Sr No Name Folder Description
1 Deployment
Group
Public Folders>PowerCenter Metadata
Reports>Configuration
Management>Deployment>Deployment
Group
Displays deployment
groups by repository
2 Deployment
Group History
Management>Deployment>Deployment
Group History
Displays, by group,
deployment groups and
the dates they were
deployed. It also
displays the source and
target repository
names of the
deployment group for
all deployment dates.
This is a primary report
in an analytic workflow.
3 Labels Public Folders>PowerCenter Metadata Displays labels created

Management>Labels>Labels
in the repository for
any versioned object by
repository.
4 All Object
Version History
Management>Object Version>All Object
Version History
Displays all versions of
an object by the date
the object is saved in
the repository. This is a
standalone report.
5 Server Load by
Day of Week
Reports>Operations>Session
Execution>Server Load by Day of Week
Displays the total
number of sessions
that ran, and the total
session run duration for
any day of week in any
given month of the
year by server by
repository. For
example, all Mondays
in September are
represented in one row
if that month had 4
Mondays
6 Session Run
Details
Execution>Session Run Details
Displays session run
details for any start
date by repository by
folder. This is a primary
report in an analytic
workflow.
7 Target Table
Load Analysis
(Last Month)
Execution>Target Table Load Analysis
(Last Month)
Displays the load
statistics for each table
for last month by
repository by folder.
8 Workflow Run
Details
Reports>Operations>Workflow
Execution>Workflow Run Details
Displays the run
statistics of all
workflows by repository
by folder. This is a
primary report in an
analytic workflow.
9 Worklet Run
Details
Reports>Operations>Workflow
Execution>Worklet Run Details
Displays the run
statistics of all worklets
by repository by folder.
10 Mapping List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Mappings>Mapping List
Displays mappings by
repository and folder. It
also displays properties
of the mapping such as
the number of sources

used in a mapping, the
number of
transformations, and
the number of targets.
11 Mapping Lookup
Transformations
Reports>PowerCenter
Objects>Mappings>Mapping Lookup
Transformations
Displays Lookup
transformations used in
a mapping by
repository and folder.
This report is a
standalone report and
also the first node in
the analytic workflow
associated with the
Mapping List primary
report.
12 Mapping
Shortcuts
Reports>PowerCenter
Objects>Mappings>Mapping Shortcuts
Displays mappings
defined as a shortcut
by repository and
folder.
13 Source to Target
Dependency
Reports>PowerCenter
Objects>Mappings>Source to Target
Dependency
Displays the data flow
from the source to the
target by repository
and folder. The report
lists all the source and
target ports, the
mappings in which the
ports are connected,
and the transformation
expression that shows
how data for the target
port is derived.
14 Mapplet List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Mapplets>Mapplet List
Displays mapplets
available by repository
and folder. It displays
properties of the
mapplet such as the
number of sources used
in a mapplet, the
number of
transformations, or the
number of targets. This
is a primary report in
an analytic workflow.
15 Mapplet Lookup
Transformations
Reports>PowerCenter
Objects>Mapplets>Mapplet Lookup
Transformations
Displays all Lookup
transformations used in
a mapplet by folder and
repository. This report
is a standalone report

and also the first node
in the analytic workflow
associated with the
Mapplet List primary
report.
16 Mapplet
Shortcuts
Reports>PowerCenter
Objects>Mapplets>Mapplet Shortcuts
Displays mapplets
defined as a shortcut
by repository and
folder.
17 Unused Mapplets
in Mappings
Reports>PowerCenter
Objects>Mapplets>Unused Mapplets in
Mappings
Displays mapplets
defined in a folder but
not used in any
mapping in that folder.
18 Metadata
Extensions
Usage
Reports>PowerCenter Objects>Metadata
Extensions>Metadata Extensions Usage
Displays, by repository
by folder, reusable
metadata extensions
used by any object.
Also displays the
counts of all objects
using that metadata
extension.
19 Server Grid List Public Folders>PowerCenter Metadata
Reports>PowerCenter Objects>Server
Grid>Server Grid List
Displays all server grids
and servers associated
with each grid.
Information includes
host name, port
number, and internet
protocol address of the
servers.
20 Session List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Sessions>Session List
Displays all sessions
and their properties by
21

Source List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Sources>Source List
Displays relational and
non-relational sources
by repository and
folder. It also shows
the source properties.
This report is a primary
workflow.
22 Source Shortcuts Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Sources>Source Shortcuts
Displays sources that
are defined as
shortcuts by repository
and folder
23 Target List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Targets>Target List
Displays relational and
non-relational targets
available by repository

and folder. It also
displays the target
properties. This is a
analytic workflow.
24 Target Shortcuts Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Targets>Target Shortcuts
Displays targets that
are defined as
and folder.
25 Transformation
List
Reports>PowerCenter
Objects>Transformations>Transformation
List
Displays
transformations defined
by repository and
folder. This is a primary
workflow.
26 Transformation
Shortcuts
Reports>PowerCenter
Objects>Transformations>Transformation
Shortcuts
Displays
transformations that
are defined as
and folder.
27 Scheduler
(Reusable) List
Reports>PowerCenter
Objects>Workflows>Scheduler
(Reusable) List
Displays all the
reusable schedulers
defined in the
repository and their
description and
properties by repository
by folder. This is a
analytic workflow.
28 Workflow List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Workflows>Workflow List
Displays workflows and
workflow properties by
This report is a primary
workflow.
29 Worklet List Public Folders>PowerCenter Metadata
Reports>PowerCenter
Objects>Worklets>Worklet List
Displays worklets and
worklet properties by
30 Users By Group Public Folders>PowerCenter Metadata
Reports>Security>Users By Group
Displays users by
repository and group.
Reports For PowerAnalyzer Repository
1 Bottom 10 Least
Accessed
Reports this Year
Public Folders>PowerAnalyzer Metadata
Reporting>Bottom 10 Least Accessed
Reports this Year
Displays the ten least
accessed reports for
the current year. It has
an analytic workflow
that provides access

details such as user
name and access time.
2 Report Activity
Details
Reporting>Report Activity Details
Part of the analytic
workflows "Top 10 Most
Accessed Reports This
Year", "Bottom 10
Least Accessed Reports
this Year" and "Usage
by Login (Month To
Date)".
3 Report Activity
Details for
Current Month
Reporting>Report Activity Details for
Current Month
Provides information
about reports accessed
in the current month
until current date.
4 Report Refresh
Schedule
Reporting>Report Refresh Schedule
about the next
scheduled update for
scheduled reports. It
can be used to decide
schedule timing for
various reports for
optimum system
performance.
5 Reports
Accessed by
Users Today
Reporting>Reports Accessed by Users
Today
Part of the analytic
workflow for "Today's
Logins". It provides
detailed information on
the reports accessed by
users today. This can
be used independently
to get comprehensive
information about
today's report activity
details.
6 Todays Logins Public Folders>PowerAnalyzer Metadata
Reporting>Todays Logins
Provides the login
count and average
login duration for users
who logged in today.
7 Todays Report
Usage by Hour
Reporting>Todays Report Usage by Hour
about the number of
reports accessed today
for each hour. The
analytic workflow
attached to it provides
more details on the
reports accessed and
users who accessed
them during the
selected hour.

8 Top 10 Most
Accessed
Reports this Year
Reporting>Top 10 Most Accessed Reports
this Year
Shows the ten most
accessed reports for
the current year. It has
an analytic workflow
that provides access
details such as user
name and access time.
9 Top 5 Logins
(Month To Date)
Reporting>Top 5 Logins (Month To Date)
about users and their
corresponding login
count for the current
month to date. The
analytic workflow
more details about the
reports accessed by a
selected user.
10 Top 5 Longest
Running On-
Demand Reports
(Month To Date)
Reporting>Top 5 Longest Running On-
Demand Reports (Month To Date)
Shows the five longest
running on-demand
reports for the current
month to date. It
displays the average
total response time,
average DB response
time, and the average
PowerAnalyzer
response time (all in
seconds) for each
report shown.
11 Top 5 Longest
Running
Scheduled
Reports (Month
To Date)
Reporting>Top 5 Longest Running
Scheduled Reports (Month To Date)
Shows the five longest
running scheduled
reports for the current
month to date. It
displays the average
response time (in
seconds) for each
report shown.
12 Total Schedule
Errors for Today
Reporting>Total Schedule Errors for
Today
Provides the number of
errors encountered
during execution of
reports attached to
schedules. The analytic
workflow "Scheduled
Report Error Details for
Today" is attached to it.
13 User Logins
(Month To Date)
Reporting>User Logins (Month To Date)
about users and their
corresponding login
count for the current
month to date. The

analytic workflow
more details about the
reports accessed by a
selected user.
14 Users Who Have
Never Logged
On
Reporting>Users Who Have Never Logged
On
about users who exist
in the repository but
have never logged in.
This information can be
used to make
administrative
decisions about
disabling accounts.
Customizing a Report or Creating New Reports
Once you select the report, you can customize it by setting the parameter values
and/or creating new attributes or metrics. PowerAnalyzer includes simples steps to
create new reports or modify existing ones. Adding filters or modifying filters offers
tremendous reporting flexibility. Additionally, you can setup report templates and
export them as Excel files, which can be refreshed as necessary. For more information
on the attributes, metrics, and schemas included with the Metadata Reporter, consult
the product documentation.
Wildcards
The Metadata Reporter supports two wildcard characters:
Percent symbol (%) - represents any number of characters and spaces.
Underscore (_) - represents one character or space.
You can use wildcards in any number and combination in the same parameter. Leaving
a parameter blank returns all values and is the same as using %. The following
examples show how you can use the wildcards to set parameters.
Suppose you have the following values available to select:
items, items_in_promotions, order_items, promotions
The following list shows the return values for some wildcard combinations you can use:
Wildcard Combination Return Values
%
items, items_in_promotions, order_items,
promotions
<blank>
items, items_in_promotions, order_items,
promotions
%items items, order_items

item_ Items
item% items, items_in_promotions
___m% items, items_in_promotions, promotions
%pr_mo% items_in_promotions, promotions
A printout of the mapping object flow is also useful for clarifying how objects are
connected. To produce such a printout, arrange the mapping in Designer so the full
mapping appears on the screen, and then use Alt+PrtSc to copy the active window to
the clipboard. Use Ctrl+V to paste the copy into a Word document.
For a detailed description of how to run these reports, consult the Metadata Reporter
Guide included in the PowerCenter documentation.
Security Awareness for Metadata Reporter
Metadata Reporter uses PowerAnalyzer for reporting out of the PowerCenter
/PowerAnalyzer repository. PowerAnalyzer has a robust security mechanism that is
inherited by Metadata Reporter. You can establish groups, roles, and/or privileges for
users based on their profiles. Since the information in PowerCenter repository does not
change often after it goes to production, the Administrator can create some reports and
export them to files that can be distributed to the user community. If the numbers of
users for Metadata Reporter are limited, you can implement security using report filters
or data restriction feature. For example, if a user in PowerCenter repository has access
to certain folders, you can create a filter for those folders and apply it to the user's
profile. For more information on the ways in which you can implement security in
PowerAnalyzer, refer to the PowerAnalyzer documentation.
Metadata Exchange: the Second Generation (MX2)
The MX architecture was intended primarily for BI vendors who wanted to create a
PowerCenter-based data warehouse and display the warehouse metadata through their
own products. The result was a set of relational views that encapsulated the underlying
repository tables while exposing the metadata in several categories that were more
suitable for external parties. Today, Informatica and several key vendors, including
Brio, Business Objects, Cognos, and MicroStrategy are effectively using the MX views to
report and query the Informatica metadata.
Informatica currently supports the second generation of Metadata Exchange called MX2.
Although the overall motivation for creating the second generation of MX remains
consistent with the original intent, the requirements and objectives of MX2 supersede
those of MX.
The primary requirements and features of MX2 are:
Incorporation of object technology in a COM-based API. Although SQL provides a
powerful mechanism for accessing and manipulating records of data in a relational
paradigm, it is not suitable for procedural programming tasks that can be achieved by
C, C++, Java, or Visual Basic. Furthermore, the increasing popularity and use of object-
oriented software tools require interfaces that can fully take advantage of the object
technology. MX2 is implemented in C++ and offers an advanced object-based API for

accessing and manipulating the PowerCenter Repository from various programming
languages.
Self-contained Software Development Kit (SDK). One of the key advantages of MX
views is that they are part of the repository database and thus can be used
independent of any of the Informatica software products. The same requirement also
holds for MX2, thus leading to the development of a self-contained API Software
Development Kit that can be used independently of the client or server products.
Extensive metadata content, especially multidimensional models for OLAP. A
number of BI tools and upstream data warehouse modeling tools require complex
multidimensional metadata, such as hierarchies, levels, and various relationships. This
type of metadata was specifically designed and implemented in the repository to
accommodate the needs of the Informatica partners by means of the new MX2
interfaces.
Ability to write (push) metadata into the repository. Because of the limitations
associated with relational views, MX could not be used for writing or updating metadata
in the Informatica repository. As a result, such tasks could only be accomplished by
directly manipulating the repository's relational tables. The MX2 interfaces provide
metadata write capabilities along with the appropriate verification and validation
features to ensure the integrity of the metadata in the repository.
Complete encapsulation of the underlying repository organization by means of
an API. One of the main challenges with MX views and the interfaces that access the
repository tables is that they are directly exposed to any schema changes of the
underlying repository database. As a result, maintaining the MX views and direct
interfaces requires a major effort with every major upgrade of the repository. MX2
alleviates this problem by offering a set of object-based APIs that are abstracted away
from the details of the underlying relational tables, thus providing an easier mechanism
for managing schema evolution.
Integration with third-party tools. MX2 offers the object-based interfaces needed to
develop more sophisticated procedural programs that can tightly integrate the
repository with the third-party data warehouse modeling and query/reporting tools.
Synchronization of metadata based on changes from up-stream and down-
stream tools. Given that metadata is likely to reside in various databases and files in a
distributed software environment, synchronizing changes and updates ensures the
validity and integrity of the metadata. The object-based technology used in MX2
provides the infrastructure needed to implement automatic metadata synchronization
and change propagation across different tools that access the PowerCenter Repository.
Interoperability with other COM-based programs and repository interfaces.
MX2 interfaces comply with Microsoft's Component Object Model (COM) interoperability
protocol. Therefore, any existing or future program that is COM-compliant can
seamlessly interface with the PowerCenter Repository by means of MX2.


Repository Tables & Metadata Management
Challenge
Maintaining the repository for regular backup, quick response, and querying metadata
for metadata reports.
Description
Regular actions such as backups, testing backup and restore procedures, and deleting
unwanted information from the repository maintains the repository for better
performance.
Managing Repository
The PowerCenter Administrator plays a vital role in managing and maintaining the
repository and metadata. The role involves tasks such as securing the repository,
managing the users and roles, maintaining backups, and managing the repository
through such activities as removing unwanted metadata, analyzing tables, and updating
statistics.
Repository backup
Repository back up can be performed using the client tool Repository Server Admin
Console or the command line program pmrep. Backup using pmrep can be automated
and scheduled for regular backups.


Figure 1 Shell Script to backup repository
This shell script can be scheduled to run as cron job for regular backups. Alternatively,
this shell script can be called from PowerCenter via a command task. The command
task can be placed in a workflow and scheduled to run daily.

Figure 2 Repository Backup workflow
The following paragraphs describe some useful practices for maintaining backups:
Frequency: Backup frequency depends on the activity in repository. For Production
repositories, backup is recommended once a month or prior to major release. For
development repositories, backup is recommended once a week or once a day,
depending upon the team size.
Backup file sizes: Because backup files can be very large, Informatica recommends
compressing them using a utility such as winzip or gzip.
Storage: For security reasons, Informatica recommends maintaining backups on a
different physical device that the repository itself.

Move backups offline: Review the backups on a regular basis to determine how long
they need to remain online. Any that are not required online should be moved offline,
to tape, as soon as possible.
Restore repository
Although the Repository restore function is used primarily as part of disaster recovery,
it can also be useful for testing the validity of the backup files and for testing the
recovery process on a regular basis. Informatica recommends testing the backup files
and recovery process at least once each quarter. The repository can be restored using
the client tool, Repository Server Administrator Console, or the command line programs
pmrepagent.
Restore folders
There is no easy way to restore only one particular folder from backup. First the backup
repository has to be restored into a new repository, then you can use the client tool,
repository manager, to copy the entire folder from the restored repository into the
target repository.
Remove older versions
Use the purge command to remove older versions of objects from repository. To purge
a specific version of an object, view the history of the object, select the version, and
purge it.
Finding deleted objects and removing them from repository
If a PowerCenter repository is enabled for versioning through the use of the Team
Based Development option. Objects that have been deleted from the repository are not
be visible in the client tools. To list or view deleted objects, use either the find
checkouts command in the client tools or a query generated in the repository manager,
or a specific query.


Figure 3 Query to list DELETED objects
After an object has been deleted from the repository, you cannot create another object
with the same name unless the deleted object has been completely removed from t he
repository. Use the purge command to completely remove deleted objects from the
repository. Keep in mind, however, that you must remove all versions of a deleted
object to completely remove it from repository.
Truncating Logs
You can truncate the log information (for sessions and workflows) stored in the
repository either by using repository manager or the pmrep command line program.
Logs can be truncated for the entire repository or for a particular folder.
Options allow truncating all log entries or selected entries based on date and time.


Figure 4 Truncate Log for entire repository

Figure 5 Truncate Log - for a specific folder
Repository Performance
Analyzing (or updating the statistics) of repository tables can help to improve the
repository performance. Because this process should be carried out for all tables in the
repository, a script offers the most efficient means. You can then schedule the script to
run using either an external scheduler or a PowerCenter workflow with a command task
to call the script.
Repository Agent and Repository Server performance
Factors such as team size, network, number of objects involved in a specific operation,
number of old locks (on repository objects), etc. may reduce the efficiency of the
repository server (or agent). In such cases, the various causes should be analyzed and
the repository server (or agent) configuration file modified to improve performance.
Managing Metadata
The following paragraphs list the queries that are most often used to report on
PowerCenter metadata. The queries are written for PowerCenter repositories on Oracle

and are based on PowerCenter 6 and PowerCenter 7. Minor changes in the queries may
be required for PowerCenter repositories residing on other databases.
Failed Sessions
The following query lists the failed sessions in the last day. To make it work for the last
n days, replace SYSDATE-1 with SYSDATE - n
SELECT Subject_Area AS Folder,
Session_Name,
Last_Error AS Error_Message,
DECODE (Run_Status_Code,3,'Failed',4,'Stopped',5,'Aborted') AS Status,
Actual_Start AS Start_Time,
Session_TimeStamp
FROM rep_sess_log
WHERE run_status_code != 1
AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)
Long running Sessions
The following query lists long running sessions in the last day. To make it work for the
last n days, replace SYSDATE-1 with SYSDATE - n
SELECT Subject_Area AS Folder,
Session_Name,
Successful_Source_Rows AS Source_Rows,
Successful_Rows AS Target_Rows,
Actual_Start AS Start_Time,
Session_TimeStamp
FROM rep_sess_log
WHERE run_status_code = 1
AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)

AND (Session_TimeStamp - Actual_Start) > (10/(24*60))
ORDER BY Session_timeStamp
Invalid Tasks
The following query lists folder names and task name, version number, and last saved
for all invalid tasks.
SELECT SUBJECT_AREA AS FOLDER_NAME,
DECODE(IS_REUSABLE,1,'Reusable',' ') || ' ' ||TASK_TYPE_NAME AS TASK_TYPE,
TASK_NAME AS OBJECT_NAME,
VERSION_NUMBER, -- comment out for V6
LAST_SAVED
FROM REP_ALL_TASKS
WHERE IS_VALID=0
AND IS_ENABLED=1
--AND CHECKOUT_USER_ID = 0 -- Comment out for V6
--AND is_visible=1 -- Comment out for V6
ORDER BY SUBJECT_AREA,TASK_NAME
Load Counts
The following query lists the load counts (number of rows loaded) for the successful
sessions.
SELECT
subject_area,
workflow_name,
session_name,
DECODE (Run_Status_Code,1,'Succeeded',3,'Failed',4,'Stopped',5,'Aborted') AS
Session_Status,
successful_rows,

failed_rows,
actual_start
FROM
REP_SESS_LOG
WHERE
TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)
ORDER BY
subject_area,
workflow_name,
session_name,
Session_status


Using Metadata Extensions
Challenge
To provide for efficient documentation and achieve extended metadata reporting
through the use of metadata extensions in repository objects.
Description
Metadata Extensions, as the name implies, help you to extend the metadata stored in
the repository by associating information with individual objects in the repository.
Informatica Client applications can contain two types of metadata extensions: vendor-
defined and user-defined.
Vendor-defined. Third-party application vendors create vendor-defined metadata
extensions. You can view and change the values of vendor-defined metadata
extensions, but you cannot create, delete, or redefine them.
User-defined. You create user-defined metadata extensions using PowerCenter
clients. You can create, edit, delete, and view user-defined metadata extensions.
You can also change the values of user-defined extensions.
You can create reusable or non-reusable metadata extensions. You associate reusable
metadata extensions with all repository objects of a certain type. So, when you create a
reusable extension for a mapping, it is available for all mappings. Vendor-defined
metadata extensions are always reusable.
Non-reusable extensions are associated with a single repository object. Therefore, if
you edit a target and create a non-reusable extension for it, that extension is available
only for the target you edit. It is not available for other targets. You can promote a
non-reusable metadata extension to reusable, but you cannot change a reusable
metadata extension to non-reusable.
Metadata extensions can be created for the following repository objects:
Source definitions
Target definitions
Transformations (Expressions, Filters, etc.)
Mappings
Mapplets

Sessions
Tasks
Workflows
Worklets
Metadata Extensions offer a very easy and efficient method of documenting important
information associated with repository objects. For example, when you create a
mapping, you can store the mapping owners name and contact information with the
mapping OR when you create a source definition, you can enter the name of the person
who created/imported the source.
The power of metadata extensions is most evident in the reusable type. When you
create a reusable metadata extension for any type of repository object, that metadata
extension becomes part of the properties of that type of object. For example, suppose
you create a reusable metadata extension for source definitions called SourceCreator.
When you create or edit any source definition in the Designer, the SourceCreator
extension appears on the Metadata Extensions tab. Anyone who creates or edits a
source can enter the name of the person that created the source into this field.
You can create, edit, and delete non-reusable metadata extensions for sources, targets,
transformations, mappings, and mapplets in the Designer. You can create, edit, and
delete non-reusable metadata extensions for sessions, workflows, and worklets in the
Workflow Manager. You can also promote non-reusable metadata extensions to
reusable extensions using the Designer or the Workflow Manager. You can also create
reusable metadata extensions in the Workflow Manager or Designer.
You can create, edit, and delete reusable metadata extensions for all types of
repository objects using the Repository Manager. If you want to create, edit, or
delete metadata extensions for multiple objects at one time, use the Repository
Manager. When you edit a reusable metadata extension, you can modify the properties
Default Value, Permissions and Description.
Note: You cannot create non-reusable metadata extensions in the Repository Manager.
All metadata extensions created in the Repository Manager are reusable. Reusable
metadata extensions are repository wide.
You can also migrate Metadata Extensions from one environment to another. When you
do a copy folder operation, the Copy Folder Wizard copies the metadata extension
values associated with those objects to the target repository. A non-reusable metadata
extension will be copied as a non-reusable metadata extension in the target repository.
A reusable metadata extension is copied as reusable in the target repository, and the
object retains the individual values. You can edit and delete those extensions, as well
as modify the values.
Metadata Extensions provide for extended metadata reporting capabilities. Using
Informatica MX2 API, you can create useful reports on metadata extensions. For
example, you can create and view a report on all the mappings owned by a specific
team member. You can use various programming environments such as Visual Basic,
Visual C++, C++ and Java SDK to write API modules. The Informatica Metadata
Exchange SDK 6.0 installation CD includes sample Visual Basic and Visual C++
applications.

Additionally, Metadata Extensions can also be populated via data modeling tools such
as ERWin, Oracle Designer, and PowerDesigner via Informatica Metadata Exchange for
Data Models. With the Informatica Metadata Exchange for Data Models, the Informatica
Repository interface can retrieve and update the extended properties of source and
target definitions in PowerCenter repositories. Extended Properties are the descriptive,
user defined, and other properties derived from your Data Modeling tool and you can
map any of these properties to the metadata extensions that are already defined in the
source or target object in the Informatica repository.


Daily Operations
Challenge
Once the data warehouse has been moved to production, the most important task is
keeping the system running and available for the end users.
Description
In most organizations, the day-to-day operation of the data warehouse is the
responsibility of a Production Support Team. This team is typically involved with the
support of other systems and has expertise in database systems and various operating
systems. The Data Warehouse Development team, becomes in effect, a customer to the
Production Support team. To that end, the Production Support team needs two
documents, a Service Level Agreement and an Operations Manual, to help in the
support of the production data warehouse.
Service Level Agreement
The Service Level Agreement outlines how the overall data warehouse system is to be
maintained. This is a high-level document that discusses system maintenance and the
components of the system, and identifies the groups responsible for monitoring the
various components. At a minimum, it should contain the following information:
Times when the system should be available to users.
Scheduled maintenance window.
Who is expected to monitor the operating system.
Who is expected to monitor the database.
Who is expected to monitor the PowerCenter sessions.
How quickly the support team is expected to respond to notifications of system
failures.
Escalation procedures that include data warehouse team contacts in the event that
the support team cannot resolve the system failure.

Operations Manual
The Operations Manual is crucial to the Production Support team because it provides
the information needed to perform the data warehouse system maintenance. This
manual should be self-contained, providing all of the information necessary for a

production support operator to maintain the system and resolve most problems that
may arise. This manual should contain information on how to maintain all data
warehouse system components. At a minimum, the Operations Manual should contain:
Information on how to stop and re-start the various components of the system.
Ids and passwords (or how to obtain passwords) for the system components.
Information on how to re-start failed PowerCenter sessions and recovery
procedures.
A listing of all jobs that are run, their frequency (daily, weekly, monthly, etc.), and
the average run times.
Error handling strategies.
Who to call in the event of a component failure that cannot be resolved by the
Production Support team.


Data Integration Load Traceability
Challenge
Load management is one of the major difficulties facing a data integration or data
warehouse operations team. This Best Practice tries to answer the following questions:
How can the team keep track of what has been loaded?
What order should the data be loaded in?
What happens when there is a load failure?
How can bad data be removed and replaced?
How can the source of data be identified?
When it was loaded?
Description
Load management provides an architecture to allow all of the above questions to be
answered with minimal operational effort.
Benefits of a Load Management Architecture
Data Lineage
The term Data Lineage is used to describe the ability to track data from its final resting
place in the target back to its original source. This requires the tagging of every row of
data in the target with an ID from the load management metadata model. This serves
as a direct link between the actual data in the target and the original source data.
To give an example of the usefulness of this ID, a data warehouse or integration
competency center operations team, or possibly end users, can, on inspection of any
row of data in the target schema, link back to see when it was loaded, where it came
from, any other metadata about the set it was loaded with, validation check results,
number of other rows loaded at the same time, and so forth.
It is also possible to use this ID to link one row of data with all of the other rows loaded
at the same time. This can be useful when a data issue is detected in one row and the
operations team needs to see if the same error exists in all of the other rows. More
than this, it is the ability to easily identify the source data for a specific row in the
target, enabling the operations team to quickly identify where a data issue may lie.

It is often assumed that data issues are produced by the transformation processes
executed as part of the target schema load. Using the source ID to link back the source
data makes it easy to identify whether the issues were in the source data when it was
first encountered by the target schema load processes or if those load processes caused
the issue. This ability can save a huge amount of time, expense, and frustration --
particularly in the initial launch of any new subject areas.
Process Lineage
Tracking the order that data was actually processed in is often the key to resolving
processing and data issues. Because choices are often made during the processing of
data based on business rules and logic, the order and path of processing differs from
one run to the next. Only by actually tracking these processes as they act upon the
data can issue resolution be simplified.
Process Dependency Management
Having a metadata structure in place provides an environment to facilitate the
application and maintenance of business dependency rules. Once a structure is in place
that identifies every process, it becomes very simple to add the necessary metadata
and validation processes required to ensure enforcement of the dependencies among
processes. Such enforcement resolves many of the scheduling issues that operations
teams typically faces.
Process dependency metadata needs to exist because it is often not possible to rely on
the source systems to deliver the correct data at the correct time. Moreover, in some
cases, transactions are split across multiple systems and must be loaded into the target
schema in a specific order. This is usually difficult to manage because the various
source systems have no way of coordinating the release of data to the target schema.
Robustness
Using load management metadata to control the loading process also offers two other
big advantages, both of which fall under the heading of robustness because they allow
for a degree of resilience to load failure.
Load Ordering
Load ordering is a set of processes that use the load management metadata to identify
the order in which the source data should be loaded. This can be as simple as making
sure the data is loaded in the sequence it arrives, or as complex as having a pre-
defined load sequence planned in the metadata.
There are a number of techniques used to manage these processes. The most common
is an automated process that generates a PowerCenter load list from flat files in a
directory, then archives the files in that list after the load is complete. This process can
use embedded data in file names or can read header records to identify the correct
ordering of the data. Alternatively the correct order can be pre-defined in the load
management metadata using load calendars.

Either way, load ordering should be employed in any data integration or data
warehousing implementation because it allows the load process to be automatically
paused when there is a load failure, and ensures that the data that has been put on
hold is loaded in the correct order as soon as possible after a failure.
The essential part of the load management process is that it operates without human
intervention, helping to make the system self healing!
Rollback
If there is a loading failure or a data issue in normal daily load operations, it is usually
preferable to remove all of the data loaded as one set. Load management metadata
allows the operations team to selectively roll back a specific set of source data, the data
processed by a specific process, or a combination of both. This can be done using
manual intervention or by a developed automated feature.
Simple Load Management Metadata Model

As you can see from the simple load management metadata model above, there are
two sets of data linked to every transaction in the target tables. These represent the
two major types of load management metadata:
Source tracking
Process tracking

Source Tracking

Source tracking looks at how the target schema validates and controls the loading of
source data. The aim is to automate as much of the load processing as possible and
track every load from the source through to the target schema.
Source Definitions
Most data integration projects use batch load operations for the majority of data
loading. The sources for these come in a variety of forms, including flat file formats
(ASCII, XML etc), relational databases, ERP systems, and legacy mainframe systems.
The first control point for the target schema is to maintain a definition of how each
source is structured, as well as other validation parameters.
These definitions should be held in a Source Master table like the one shown in the data
model above.
These definitions can and should be used to validate that the structure of the source
data has not changed. A great example of this practice is the use of DTD files in the
validation of XML feeds.
In the case of flat files, it is usual to hold details like:
Header information (if any)
How many columns
Data types for each column
Expected number of rows
For RDBMS sources, the Source Master record might hold the definition of the source
tables or store the structure of the SQL statement used to extract the data (i.e., the
SELECT, FROM and ORDER BY clauses).
These definitions can be used to manage and understand the initial validation of the
source data structures. Quite simply, if the system is validating the source against a
definition, there is an inherent control point at which problem notifications and recovery
processes can be implemented. Its better to catch a bad data structure than to start
loading bad data.
Source Instances
A Source Instance table (as shown in the load management metadata model) is
designed to hold one record for each separate set of data of a specific source type
being loaded. It should have a direct key link back to the Source Master table which
defines its type.
The various source types may need slightly different source instance metadata to
enable optimal control over each individual load.
Unlike the source definitions, this metadata will change every time a new extract and
load is performed. In the case of flat files, this would be a new file name and possibly
date / time information from its header record. In the case of relational data, it would

be the selection criteria (i.e., the SQL WHERE clause) used for each specific extract,
and the date and time it was executed.
This metadata needs to be stored in the source tracking tables so that the operations
team can identify a specific set of source data if the need arises. This need may arise if
the data needs to be removed and reloaded after an error has been spotted in the
target schema.
Process Tracking
Process tracking describes the use of load management metadata to track and control
the loading processes rather than the specific data sets themselves. There can often be
many load processes acting upon a single source instance set of data.
While it is not always necessary to be able to identify when each individual process
completes, it is very beneficial to know when a set of sessions that move data from one
stage to the next has completed. Not all sessions are tracked this way because, in most
cases, the individual processes are simply storing data into temporary tables that will
be flushed at a later date. Since load management process IDs are intended to track
back from a record in the target schema to the process used to load it, it only makes
sense to generate a new process ID if the data is being stored permanently in one of
the major staging areas.
Process Definition
Process definition metadata is held in the Process Master table (as shown in the load
management metadata model ). This, in its basic form, holds a description of the
process and its overall status. It can also be extended, with the introduction of other
tables, to reflect any dependencies among processes, as well as processing holidays.
Process Instances
A process instance is represented by an individual row in the load management
metadata Process Instance table. This represents each instance of a load process that is
actually run. This holds metadata about when the process started and stopped, as well
as its current status. Most importantly, this table allocates a unique ID to each
instance.
The unique ID allocated in the process instance table is used to tag every row of source
data. This ID is then stored with each row of data in the target table.
Integrating Source and Process Tracking
Integrating source and process tracking can produce an extremely powerful
investigative and control tool for the administrators of data warehouses and integrated
schemas. This is achieved by simply linking every process ID with the source instance
ID of the source it is processing. This requires that a write-back facility be built into
every process to update its process instance record with the ID of the source instance
being processed.

The effect is that there is a one to one/many relationship between the source instance
table and the process instance table containing several rows for each set of source data
loaded into a target schema. For example, in a data warehousing project, a row for
loading the extract into a staging area, a row for the move from the staging area to an
ODS, and a final row for the move from the ODS to the warehouse.
Integrated Load Management Flow Diagram
Tracking Transactions
This is the simplest data to track since it is loaded incrementally and not updated. This
means that the process and source tracking discussed earlier in this document can be
applied as is.

Tracking Reference Data
This task is complicated by the fact that reference data, by its nature, is not static. This
means that if you simply update the data in a row any time there is a change, there is
no way that the change can be backed out using the load management practice
described earlier. Instead, Informatica recommends always using slowly changing
dimension processing on every reference data and dimension table to accomplish
source and process tracking. Updating the reference data as a slowly changing table
retains the previous versions of updated records, thus allowing any changes to be
backed out.
Tracking Aggregations
Aggregation also causes additional complexity for load management because the
resulting aggregate row very often contains the aggregation across many source data
sets. As with reference data, this means that the aggregated row cannot be backed out
in the same way as transactions.
This problem is managed by treating the source of the aggregate as if it was an original
source. This means that rather than trying to track the original source, the load
management metadata only tracks back to the transactions in the target that have
been aggregated. So, the mechanism is the same as used for transactions but the
resulting load management metadata only tracks back from the aggregate to the fact
table in the target schema.


Event Based Scheduling
Challenge
In an operational environment, the beginning of a task often needs to be triggered by
some event, either internal or external, to the Informatica environment. In versions of
PowerCenter prior to version 6.0, this was achieved through the use of indicator files.
In PowerCenter 6.0 and forward, it is achieved through use of the EventRaise and
EventWait Workflow and Worklet tasks, as well as indicator files.
Description
Event-based scheduling with versions of PowerCenter prior to 6.0 was achieved through
the use indicator files. Users specified the indicator file configuration in the session
configuration under advanced options. When the session started, the PowerCenter
Server looked for the specified file name; if it wasnt there, it waited until it appeared,
then deleted it, and triggered the session.
In PowerCenter 6.0 and above, event-based scheduling is triggered by Event-Wait and
Event-Raise tasks. These tasks can be used to define task execution order within a
workflow or worklet. They can even be used to control sessions across workflows.
An Event-Raise task represents a user-defined event (i.e., an indicator file).
An Event-Wait task waits for an event to occurwithin a workflow. After the event
triggers, the PowerCenter Server continues executing the workflow from the
Event-Wait task forward.
The following paragraphs describe events that can be triggered by an Event-Wait task.
Waiting for Pre-Defined Events
To use a pre-defined event, you need a session, shell command, script, or batch file to
create an indicator file. You must create the file locally or send it to a directory local to
the PowerCenter Server. The file can be any format recognized by the PowerCenter
Server operating system. You can choose to have the PowerCenter Server delete the
indicator file after it detects the file, or you can manually delete the indicator file. The
PowerCenter Server marks the status of the Event-Wait task as "failed" if it cannot
delete the indicator file.

When you specify the indicator file in the Event-Wait task, specify the directory in which
the file will appear and the name of the indicator file. Do not use either a source or
target file name as the indicator file name. You must also provide the absolute path for
the file and the directory must be local to the PowerCenter Server. If you only specify
the file name, and not the directory, Workflow Manager looks for the indicator file in the
system directory. For example, on Windows NT, the system directory is
C:/winnt/system32. You can enter the actual name of the file or use server variables to
specify the location of the files. The PowerCenter Server writes the time the file appears
in the workflow log.
Follow these steps to set up a pre-defined event in the workflow:
1. Create an Event-Wait task and double-click the Event-Wait task to open the Edit
Tasks dialog box.
2. In the Events tab of the Edit Task dialog box, select Pre-defined.
3. Enter the path of the indicator file.
4. If you want the PowerCenter Server to delete the indicator file after it detects
the file, select the Delete Indicator File option in the Properties tab.
5. Click OK.
Pre-defined Event
A pre-defined event is a file-watch event. For pre-defined events, use an Event-Wait
task to instruct the PowerCenter Server to wait for the specified indicator file to appear
before continuing with the rest of the workflow. When the PowerCenter Server locates
the indicator file, it starts the task downstream of the Event-Wait.
User-defined Event
A user-defined event is defined at the workflow or worklet level and the Event-Raise
task triggers the event at one point of the workflow/worklet. If an Event-Wait task is
configured in the same workflow/worklet to listen for that event, then execution will
continue from the Event-Wait task forward.
The following is an example of using user-defined events:
Assume that you have four sessions that you want to execute in a workflow. You want
P1_session and P2_session to execute concurrently to save time. You also want to
execute Q3_session after P1_session completes. You want to execute Q4_session only
when P1_session, P2_session, and Q3_session complete. Follow these steps:
1. Link P1_session and P2_session concurrently.
2. Add Q3_session after P1_session
3. Declare an event called P1Q3_Complete in the Events tab of the workflow
properties
4. In the workspace, add an Event-Raise task after Q3_session.
5. Specify the P1Q3_Complete event in the Event-Raise task properties. This allows
the Event-Raise task to trigger the event when P1_session and Q3_session
complete.
6. Add an Event-Wait task after P2_session.
7. Specify the Q1 Q3_Complete event for the Event-Wait task.

8. Add Q4_session after the Event-Wait task. When the PowerCenter Server
processes the Event-Wait task, it waits until the Event-Raise task triggers
Q1Q3_Complete before it executes Q4_session.
The PowerCenter Server executes the workflow in the following order:
1. The PowerCenter Server executes P1_session and P2_session concurrently.
2. When P1_session completes, the PowerCenter Server executes Q3_session.
3. The PowerCenter Server finishes executing P2_session.
4. The Event-Wait task waits for the Event-Raise task to trigger the event.
5. The PowerCenter Server completes Q3_session.
6. The Event-Raise task triggers the event, Q1Q3_complete.
7. The Informatica Server executes Q4_session because the event,
Q1Q3_Complete, has been triggered.
Be sure to take carein setting the links though. If they are left as the default and if Q3
fails, the Event-Raise will never happen. Then the Event-Wait will wait forever and the
workflow will run until it is stopped. To avoid this, check the workflow option suspend
on error. With this option, if a session fails, the whole workflow goes into suspended
mode and can send an email to notify developers.


High Availability
Challenge
Availability of the environment that processes data is key to all organizations. When
processing systems are unavailable, companies are not able to meet their service level
agreements and service their internal and external customers.
High availability within the PowerCenter architecture is related to making sure the
necessary processing resources are available to meet these business needs.
In a highly available environment such as PowerCenter, load schedules cannot be
allowed to be impacted by the failure of physical hardware. The PowerCenter server
must be running at all times. If the machine hosting the PowerCenter server goes
down, another machine must recognize this and start another server and transfer
responsibility for running the sessions and batches.
Processes also need to be designed for restartability and to handle switching between
servers, making all processes server independent.
These architecture and process considerations support High Availability in a
PowerCenter environment.
Description
In PowerCenter terms High availability is best accomplished in a clustered
environment.
Example
While there are many types of hardware and many ways to configure a clustered
environment, this example is based on the following hardware and software
characteristics:
Two Sun 4500s, running Solaris OS
Sun High-Availability Clustering Software
External EMC storage, with each server owning specific disks
PowerCenter installed on a separate disk that is accessible by both servers in the
cluster, but only by one server at a time

One of the Sun 4500s serves as the primary data integration server, while the other
server in the cluster is the secondary server. Under normal operations, the
PowerCenter server thinks it is physically hosted by the primary server and uses the
resources of the primary server, although it is physically located on its own server.
When the primary server goes down, the Sun high-availability software automatically
starts the PowerCenter server on the secondary server using the basic auto start/stop
scripts that are used in many UNIX environments to automatically start the
PowerCenter server whenever a host is rebooted. In addition, the Sun high-availability
software changes the ownership of the disk where the PowerCenter server is installed
from the primary server to the secondary server. To facilitate this, a logical IP address
can be created specifically for the PowerCenter server. This logical IP address is
specified in the pmserver.cfg file instead of the physical IP addresses of the servers.
Thus, only one pmserver.cfg file is needed.
Note: The pmserver.cfg file is located with the pmserver code, typically at:
{informatica_home}/{version label}/pmserver.
Process
A high-availability environment can handle a variety of situations such as hardware
failures or core dumps. However, such an environment can also generate problems
when processes fail via an Abort mid-stream which uses signal files, surrogate keys or
other intermediate results.
When an abort occurs on the non-Informatica side, any intermediate files created by
UNIX scripts need to be taken into account in the restart procedures. However, if an
abort or system failure occurs on the Informatica side, any write-back to the repository
will not be executed. For example, if a sequence generator is being used for a
surrogate key, the final surrogate key value will not be written to the repository. This
problem needs to be addressed as part of the restart logic by caching sequence
generator values or designing code that can handle this situation.
An example of the consequences of not addressing this problem could include incorrect
handling of surrogate keys. A surrogate key is a key that does not have business
meaning, it is generated as part of a process. Informatica sequence generators are
frequently used to hold the next key value to use for a new key. If a hardware failure
occurs, the current value of the sequence generator will not be written to the
repository. Therefore, without handling this situation, the next time a new row is
written it would use an old key value and update an incorrect row of data. This would
be a catastrophic data problem and must be prevented.
It is recommended to design processes that can restart in the event of any failure
including this example without any manual cleanup required. For the above surrogate
key problem there are two solutions:
Every time you get a sequence value, cache the number of values that will be
needed before the next commit of the database. While this will prevent the
catastrophic data problem, it also could waste a large number of key values that
were never used.
An alternative approach would be to lookup the maximum key value each time this
process runs, then use the sequence generator reset feature and always start

at 1, incrementing the value for each new row of data. This would allow simple
and risk-free restarts and not waste any key values. This is the recommended
approach.
The previous example is just one of many potential restart problems. Developers need
to design carefully and extend these principles to other objects such as variable values,
run details, and any other details written to the repository at the completion of a
session or task. These problems are most significant when the repository is used to
hold process data or when temporary results are stored on the server rather then
having processes handle these situations.
Designing a high-availability system
In developing a high-availability system or developing processes in a high availability
environment, it is advisable to address the following process issues :
Issue Solution
Are signal files used? As part of the restart process, check for
the existence of signal files and clean up
files as appropriate on all servers
Are sequence generators used? If sequence generators are used, write
audit or operational processes to evaluate
if a sequence generator is out of sync and
update as appropriate
Are there nested processes within a
workflow?
Are the workflows written in such a way
that they can either be restarted at the
beginning of the workflow with no ill
effects, or that the individual sessions can
be restarted without causing error handling
to fail because other sessions were not run
during the current execution
Are there batch controls that utilize
components from previous issues
Validate that batch controls can handle a
mid-stream restart.
These situations should be resolved before running high availability in production. If
the high availability environment is already in production, restart procedures should be
modified to handle these situations.
When an environment has high availability in place, all development should be designed
for restartablity and address the considerations listed in the previous examples.
Summary
High Availability in PowerCenter is composed of two sets of tasks, architectural and
procedural. It is critical that both are considered when creating a High Availability
solution. Companies must both implement a clustered environment to handle hardware
failures and develop processes which can be easily restarted regardless of the type of
failure or the server they are executing on.


Load Validation
Challenge
Knowing that all data for the current load cycle has loaded correctly is essential for
good data warehouse management. However, the need for load validation varies,
depending on the extent of error checking, data validation, and/or data cleansing
functionalities inherent in your mappings. For large data integration projects, with
thousands of mappings, the task of reporting load statuses becomes overwhelming
without a well-planned load validationprocess.
Description
Methods for validating the load process range from simple to complex. Use the
following steps to plan a load validation process:
1. Determine what information you need for load validation (e.g., workflow names,
session names, session start times, session completion times, successful rows
and failed rows).
2. Determine the source of this information. All this information is stored as
metadata in the PowerCenter repository, but you must have a means of
extracting this information.
3. Determine how you want this information presented to you. Should the
information be delivered in a report? Do you want it emailed to you? Do you
want it available in a relational table, so that history is easily preserved? Do you
want it stored as a flat file?
All of these factors weigh in finding the correct solution for you.
The following paragraphs describe five possible solutions for load validation, beginning
with a fairly simple solution and moving toward the more complex:
1. Post-session Emails on Either Success or Failure
One practical application of the post-session email functionality is the situation in which
a key business user waits for completion of a session to run a report. You can configure
email to this user, notifying him or her that the session was successful and the report
can run. Another practical application is the situation in which a production support
analyst needs to be notified immediately of any failures. You can configure the session

to send an email to the analyst for a failure. For around the clock support, a pager
number can be used in place of an email address.
Post-session e-mail is configured in the session, under the General tab and
Session Commands.
A number of variables are available to simplify the text of the e-mail:
%s Session name
%e Session status
%b Session start time
%c Session completion time
%i Session elapsed time
%l Total records loaded
%r Total records rejected
%t Target table details
%m Name of the mapping used in the session
%n Name of the folder containing the session
%d Name of the repository containing the session
%g Attach the session log to the message

2. Other Workflow Manager Features
Besides post session emails, there are other features available in the Workflow Manager
to help validate loads. Control, Decision, Event, and Timer tasks are some of the
features you can use to place multiple controls on the behavior of your loads. Another
feature is to place conditions in your links. Links are used to connect tasks within a
workflow or worklet. You can use the pre-defined or user-defined variables in the link
conditions. In the example below, upon the Successful completion of both sessions A
and B, the PowerCenter Server will execute session C.

3. PowerCenter Metadata Reporter (PCMR) Reports
The PowerCenter Metadata Reporter (PCMR) is a web-based business intelligence (BI)
tool that is included with every Informatica PowerCenter license to give visibility into
metadata stored in the PowerCenter repository in a manner that is easy to comprehend
and distribute. The PCMR includes more than 130 pre-packaged metadata reports and
dashboards delivered through PowerAnalyzer, Informaticas BI offering. These pre-

packaged reports enable PowerCenter customers to extract extensive business and
technical metadata through easy-to-read reports including:
Load statistics and operational metadata that enable load validation.
Table dependencies and impact analysis that enable change management.
PowerCenter object statistics to aid in development assistance.
Historical load statistics that enable planning for growth.

In addition to the 130 pre-packaged reports and dashboards that come standard with
PCMR, you can develop additional custom reports and dashboards based on the PCMR
limited use license that allows you to source reports from the PowerCenter repository.
Examples of custom components that can be created include:
Repository-wide reports and/or dashboards with indicators of daily load
success/failure.
Customized project-based dashboard with visual indicators of daily load
success/failure.
Detailed daily load statistics report for each project that can be exported to
Microsoft Excel or PDF.
Error handling reports that deliver error messages and source data for row level
errors that may have occurred during a load.

Below is an example of a custom dashboard that gives instant insight into the load
validation across an entire repository through four custom indicators.

4. Query Informatica Metadata Exchange (MX) Views
Informatica Metadata Exchange (MX) provides a set of relational views that allow easy
SQL access to the PowerCenter repository. The Repository Manager generates these
views when you create or upgrade a repository. Almost any query can be put together
to retrieve metadata related to the load execution from the repository. The MX view,
REP_SESS_LOG, is a great place to start. This view is likely to contain all the
information you need. The following sample query shows how to extract folder name,
session name, session end time, successful rows, and session duration:
select subject_area, session_name, session_timestamp,
successful_rows,
(session_timestamp - actual_start) * 24 * 60 * 60 from rep_sess_log a
where
session_timestamp = (select max(session_timestamp) from
rep_sess_log
where session_name =a.session_name) order by subject_area,
session_name
The sample output would look like this:


TIP
Informatica strongly advises against querying directly from the repository
tables. Since future versions of PowerCenter will most likely alter the
underlying repository tables, PowerCenter will support queries from the
unaltered MX views, not the repository tables.

5. Mapping Approach
A more complex approach, and the most customizable, is to create a PowerCenter
mapping to populate a table or flat file with desired information. You can do this by
sourcing the MX view REP_SESS_LOG and then performing lookups to other repository
tables or views for additional information.
The following graphic illustrates a sample mapping:

This mapping selects data from REP_SESS_LOG and performs lookups to retrieve the
absolute minimum and maximum run times for that particular session. This enables you
to compare the current execution time with the minimum and maximum durations.
Please note that unless you have acquired additional licensing, a customized metadata
data mart cannot be a source for a PCMR report. However, you can use a business
intelligence tool of your choice instead.


Repository Administration
Challenge
Definining the role of the PowerCenter Administrator to describe the tasks required to
properly manage the repository.
Description
The PowerCenter repository administrator has many responsibilities. In addition to
regularly backing up the repository, truncating logs, and updating the database
statistics, he or she also typically performs the following functions:.
Determine metadata strategy
Install/configure client/server software
Migrate development to test and production
Maintain PowerCenter Servers
Upgrade software
Administer security and folder organization
Monitor and tune environment
NOTE: The Repository Administrator is also typically responsible for maintaining
repository passwords; changing them on a regular basis and keeping a record of them
in a secure place.
Determine Metadata Strategy
The Repository Administrator is responsible for developing the structure and standard
for metadata in the PowerCenter Repository. This includes developing naming
conventions for all objects in the repository, creating a folder organization, and
maintaining the repository. The Administrator is also responsible for modifying the
metadata strategies to suit changing business needs or to fit the needs of a particular
project. Such changes may include new folder names and/or different security setup.
Install/Configure Client/Server Software
This responsibility includes installing and configuring the application servers in all
applicable environments (e.g., development, QA, production, etc.). The Administrator
must have a thorough understanding of the working environment, along with access to
resources such as a NT or UNIX Admin and DBA.

The Administrator is also responsible for installing and configuring the client tools.
Although end users can generally install the client software, the configuration of the
client tool connections benefits from being consistent throughout the repository
environment. The Administrator, therefore, needs to enforce this consistency in order to
maintain an organized repository.
Migrate Development to Production
When the time comes for content in the development environment to be moved to test
and production environments, it is the responsibility of the Administrator to schedule,
track, and copy folder changes. Also, it is crucial to keep track of the changes that
have taken place. It is the role of the Administrator to track these changes through a
change control process. The Administrator should be the only individual able to
physically move folders from one environment to another.
If a versioned repository is used, the Administrator should set up labels and instruct the
developers on the labels that they must apply to their repository objects (i.e., reuseable
transformations, mappings, workflows and sessions). This task also requires close
communication with project staff to review the status of items of work to ensure, for
example, that only tested or approved work is migrated.
Maintain PowerCenter Servers
The Administrator must also be able to understand and troubleshoot the server
environment. He or she should have a good understanding of how the server operates
under various situations and be fully aware of all connections to the server. The
Administrator should also understand what the server does when a session is running
and be able to identify those processes. Additionally, certain mappings may produce
files in addition to the standard session and workflow logs. The Administrator should be
familiar with these files and know how and where to maintain them.
Upgrade Software
If and when the time comes to upgrade software, the Administrator is responsible for
overseeing the installation and upgrade process.
Security and Folder Administration
Security administration consists of creating, maintaining, and updating all users within
the repository, including creating and assigning groups based on new and changing
projects and defining which folders are to be shared, and at what level. Folder
administration involves creating and maintaining the security of all folders. The
Administrator should be the only user with privileges to edit folder properties.
Tune Environment
The Administrator should have sole responsibility for implementing performance
changes to the server environment. He or she should observe server performance
throughout development so as to identify any bottlenecks in the system. In the
production environment, the Repository Administrator should monitor the jobs and any
growth (e.g., increases in data or throughput time) and communicate such change to

other staff as appropriate to address bottlenecks, accommodate growth, and ensure
that the required data is loaded within the prescribed load window.


SuperGlue Repository Administration
Challenge
The task of administering SuperGlue Repository involves taking care of both the
integration repository and the SuperGlue warehouse. This requires a knowledge of both
PowerCenter administrative features (i.e., the integration repository used in SuperGlue)
and SuperGlue administration features.
Description
A SuperGlue administrator needs to be involved in the following areas to ensure that
the SuperGlue metadata warehouse is fulfilling the end-user needs:
Migration of SuperGlue objects created in the Development environment to QA or
the Production environment
Creation and maintenance of access and privileges of SuperGlue objects
Repository backups
Job monitoring
Metamodel creation.
Migration from Development to QA or Production
In cases where a client has modified out-of-the-box objects provided in SuperGlue or
created a custom metamodel for custom metadata, the objects must be tested in the
Development environment prior to being migrated to the QA or Production
environments. The SuperGlue Administrator needs to do the following to ensure that
the objects are in sync between the two environments:
Install a new SuperGlue instance for the QA/Production environment. This involves
creating a new integration repository and SuperGlue warehouse
Export the metamodel from the Development environment and import it to QA or
production via XML Import/Export functionality (in the SuperGlue Administration
tab) or via the SGCmd command lineutility


Export the custom or modified reports created or configured in the Development
environment and import them to QA or Production via XML Import/Export
functionality in SG Administration Tab. This functionality is identical to the
function in PowerAnalyzer; refer to the PowerAnalyzer Administration Guide
for details on the import/export function.
Providing Access and Privileges
Users can perform a variety of SuperGlue tasks based on their privileges. The
SuperGlue Administrator can assign privileges to users by assigning them roles. Each
role has a set of privileges that allow the associated users to perform specific tasks. The
Administrator can also create groups of users so that all users in a particular group
have the same functions. When an Administrator assigns a role to a group, all users of
that group receive the privileges assigned to the role. For more information about
privileges, users, and groups, see the PowerAnalyzer Administrator Guide.
The SuperGlue Administrator can assign privileges to users to enable users to perform
the any of the following tasks in SuperGlue:
Configure reports. Users can view particular reports, create reports, and/or
modify the reporting schema.
Configure the SuperGlue Warehouse. Users can add, edit, and delete
repository objects using SuperGlue.
Configure metamodels. Users can add, edit, and delete metamodels.
SuperGlue also allows the Administrator to create access permissions on specific source
repository objects for specific users. Users can be restricted to reading, writing, or
deleting source repository objects that appear in SuperGlue.
Similarly, the Administrator can establish access permissions for source repository
objects in the SuperGlue warehouse. Access permissions determine the tasks that users
can perform on specific objects. When the Administrator sets access permissions, he or
she determines which users have access to the source repository objects that appear in
SuperGlue. The Administrator can assign the following types of access permissions to
objects:
Read - Grants permission to view the details of an object and the names of any
objects it contains.

Write - Grants permission to edit an object and create new repository objects in
the SuperGlue warehouse.
Delete - Grants permission to delete an object from a repository.
Change permission - Grants permission to change the access permissions for an
object.
When a repository is first loaded into the SuperGlue warehouse, SuperGlue provides all
permissions to users with the System Administrator role. All other users receive read
permissions. The Administrator can then set inclusive and exclusive access permissions

Metamodel Creation

In cases where a client needs to create custom metamodels for sourcing custom
metadata, the SuperGlue Administrator needs to create new packages, originators,
repository types and class associations. For details on how to create new metamodels
for custom metadata loading and rendering in SuperGlue, refer to the SuperGlue
Installation and Administration Guide.
Job Monitoring
When Super Glue Xconnects are running in the Production environment, Informatica
recommends monitoring loads through the SuperGlue console. The Configuration
Console Activity Log in the SuperGlue console can identify the total time it takes for
an Xconnect to complete. The console maintains a history of all runs of an Xconnect,
enabling a SuperGlue Administrator to ensure that load times are meeting the SLA
agreed upon with end users and that the load times are not increasing inordinately as
data increases in SuperGlue warehouse.
The Activity Log provides the following details about each repository load:
Repository Name- name of the source repository defined in SuperGlue
Run Start Date- day of week and date the XConnect run began
Start Time- time the XConnect run started
End Time- time the XConnect run completed
Duration- number of seconds the XConnect run took to complete
Ran From- machine hosting the source repository
Last Refresh Status- status of the XConnect run, and whether it completed
successfully or failed

Repository Backups

When SuperGlue is running in either the Production or QA environment, Informatica
recommends taking periodic backups of the following areas:
Database backups of the SuperGlue warehouse
Integration repository; Informatica recommends either of two methods for this
backup:
o The PowerCenter Repository Server Administration Console or pmrep
command line utility
o The traditional, native database backup method.
The native PowerCenter backup is required but Informatica recommends using both
methods because, if database corruption occurs, the native PowerCenter backup
provides a clean backup that can be restored to a new database.


Third Party Scheduler
Challenge
Successfully integrate a third-party scheduler with PowerCenter. This Best Practice
describes various levels to integrate a third-party scheduler.
Description
Tasks such as getting server and session properties, session status, or starting or
stopping a workflow or a task can be performed either through the Workflow Monitor or
by integrating a third-party scheduler with PowerCenter. A third-party scheduler can be
integrated with PowerCenter at any of several levels. The level of integration depends
on the complexity of the workflow/schedule and the skill sets of production support
personnel.
Many companies want to automate the scheduling process by using scripts or third-
party schedulers. In some cases, they are using a standard scheduler and want to
continue using it to drive the scheduling process.
A third-party scheduler can start or stop a workflow or task, obtain session statistics,
and get server details using the pmcmd commands. pmcmd is a program used to
communicate with the PowerCenter server. PowerCenter 7 greatly enhances pmcmd
functionality, providing commands to support the concept of workflows and workflow
monitoring while retaining compatibility with old syntax.
Third Party Scheduler Integration Levels
In general, there are three levels of integration between a third-party scheduler and
PowerCenter: Low, Medium, and High.
Low Level
Low-level integration refers to a third-party scheduler kicking off the initial PowerCenter
workflow. This process subsequently kicks off the rest of the tasks or sessions. The
PowerCenter scheduler handles all processes and dependencies after the third-party
scheduler has kicked off the initial workflow. In this level of integration, nearly all
control lies with the PowerCenter scheduler.

This type of integration is very simple to implement because the third-party scheduler
kicks off only one process. It is only used as a loophole to fulfil a corporate mandate on
a standard scheduler. This type of integration also takes advantage of the robust
functionality offered by the Workflow Monitor.
Low-level integration requires production support personnel to have a thorough
knowledge of PowerCenter. Because Production Support personnel in many companies
are only knowledgeable about the companys standard scheduler, one of the main
disadvantages of this level of integration is that if a batch fails at some point, the
Production Support personnel may not be able to determine the exact breakpoint. Thus,
the majority of the production support burden falls back on the Project Development
team.
Medium Level
With Medium-level integration, a third-party scheduler kicks off some, but not all,
workflows or tasks. Within the tasks, many sessions may be defined with dependencies.
PowerCenter controls the dependencies within the tasks.
With this level of integration, control is shared between PowerCenter and the third-
party scheduler, which requires more integration between the third-party scheduler and
PowerCenter. Medium-level integration requires Production Support personnel to have a
fairly good knowledge of PowerCenter and also of the scheduling tool. If they do not
have in-depth knowledge about the tool, they may be unable to fix problems that arise,
so the production support burden is shared between the Project Development team and
the Production Support team.
High Level
With High-level integration, the third-party scheduler has full control of scheduling and
kicks off all PowerCenter sessions. In this case, the third-party scheduler is responsible
for controlling all dependencies among the sessions. This type of integration is the most
complex to implement because there are many more interactions between the third-
party scheduler and PowerCenter.
Production Support personnel may have limited knowledge of PowerCenter but must
have thorough knowledge of the scheduling tool. Because Production Support personnel
in many companies are knowledgeable only about the companys standard scheduler,
one of the main advantages of this level of integration is that if the batch fails at some
point, the Production Support personnel are usually able to determine the exact
breakpoint. Thus, the production support burden lies with the Production Support team.
Sample Scheduler Script
There are many independent scheduling tools on the market. The following is an
example of a AutoSys script that can be used to start tasks; it is included here simply
as an illustration of how a scheduler can be implemented in the PowerCenter
environment. This script can also capture the return codes, and abort on error,
returning a success or failure (with associated return codes to the command line or the
Autosys GUI monitor).

# Name: jobname.job
# Author: Author Name
# Date: 01/03/2005
# Description:
# Schedule: Daily
#
# Modification History
# When Who Why
#
#------------------------------------------------------------------
. jobstart $0 $*
# set variables
ERR_DIR=/tmp
# Temporary file will be created to store all the Error Information
# The file format is TDDHHMISS<PROCESS-ID>.lst
curDayTime=`date +%d%H%M%S`
FName=T$CurDayTime$$.lst
if [ $STEP -le 1 ]
then
echo "Step 1: RUNNING wf_stg_tmp_product_xref_table..."
cd /dbvol03/vendor/informatica/pmserver/
#pmcmd startworkflow -s ah-hp9:4001 -u Administrator -p informat01
wf_stg_tmp_product_xref_table
#pmcmd starttask -s ah-hp9:4001 -u Administrator -p informat01 -f
FINDW_SRC_STG -w WF_STG_TMP_PRODUCT_XREF_TABLE -wait s_M_S
# The above lines need to be edited to include the name of the workflow or the task
that you are attempting to start.
TG_TMP_PRODUCT_XREF_TABLE
# Checking whether to abort the Current Process or not
RetVal=$?
echo "Status = $RetVal"
if [ $RetVal -ge 1 ]
then
jobend abnormal "Step 1: Failed wf_stg_tmp_product_xref_table...\n"
exit 1
fi
echo "Step 1: Successful"
fi
jobend normal
exit 0



Updating Repository Statistics
Challenge
The PowerCenter repository has more than 170 tables, and most have one or more
indexes to speed up queries. Most databases use column distribution statistics to
determine which index to use to optimize performance. It can be important, especially
in large or high-use repositories, to update these statistics regularly to avoid
performance degradation.
Description
For PowerCenter 7 and later, statistics are updated during copy, backup or restore
operations. In addition, the RMREP command has an option to update statistics that
can be scheduled as part of a regularly-run script.
For PowerCenter 6 and earlier there are specific strategies for Oracle, Sybase, SQL
Server, DB2 and Informix discussed below. Each example shows how to extract the
information out of the PowerCenter repository and incorporate it into a custom stored
procedure.
Features in PowerCenter version 7 and later
Copy, Backup and Restore Repositories
PowerCenter 7 automatically identifies and updates all statistics of all repository tables
and indexes when a repository is copied, backed-up, or restored. If you follow a
strategy of regular repository back-ups, the statistics will also be updated.
PMREP Command
PowerCenter 7 also has a command line option to update statistics in the database.
This allows this command to be put in a Windows batch file or Unix Shell script to run.
The format of the command is: pmrep updatestatistics {-s filelistfile}
The s option allows for you to skip different tables you may not want to update
statistics.
Example of Automating the Process

One approach to automating this would be to use a UNIX shell that includes the pmrep
command updatestatistics which is incorporated into a special workflow in
PowerCenter and run on a scheduled basis. Note: Workflow Manager supports
command line as well as scheduling.
Below listed is an example of the command line object.

In addition, this workflow can be scheduled to run continuously on a daily, weekly or
monthly schedule. This allows the statistics to be updated regularly so performance is
not degraded.
Tuning Strategies for PowerCenter version 6 and earlier
The following are strategies for generating scripts to update distribution statistics. Note
that all PowerCenter repository tables and index names begin with "OPB_" or "REP_".
Oracle
Run the following queries:
select 'analyze table ', table_name, ' compute statistics;' from user_tables where
table_name like 'OPB_%'

select 'analyze index ', INDEX_NAME, ' compute statistics;' from user_indexes where
INDEX_NAME like 'OPB_%'
This will produce output like:
'ANALYZETABLE' TABLE_NAME 'COMPUTESTATISTICS;'
analyze table OPB_ANALYZE_DEP compute statistics;
analyze table OPB_ATTR compute statistics;
analyze table OPB_BATCH_OBJECT compute statistics;
.
.
.
'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;'
analyze index OPB_DBD_IDX compute statistics;
analyze index OPB_DIM_LEVEL compute statistics;
analyze index OPB_EXPR_IDX compute statistics;
.
.
Save the output to a file. Then, edit the file and remove all the headers. (i.e., the lines
that look like:
'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;'
Run this as a SQL script. This updates statistics for the repository tables.

MS SQL Server
Run the following query:
select 'update statistics ', name from sysobjects where name like 'OPB_%'
This will produce output like :
name

update statistics OPB_ANALYZE_DEP
update statistics OPB_ATTR
update statistics OPB_BATCH_OBJECT
.
.
Save the output to a file, then edit the file and remove the header information (i.e., the
top two lines) and add a 'go' at the end of the file.

Sybase
select 'update statistics ', name from sysobjects where name like 'OPB_%'
This will produce output like
name
update statistics OPB_ANALYZE_DEP
update statistics OPB_ATTR
update statistics OPB_BATCH_OBJECT
.
.
.
Save the output to a file, then remove the header information (i.e., the top two lines),
and add a 'go' at the end of the file.

Informix

select 'update statistics low for table ', tabname, ' ;' from systables where tabname like
'opb_%' or tabname like 'OPB_%';
This will produce output like :
(constant) tabname (constant)
update statistics low for table OPB_ANALYZE_DEP ;
update statistics low for table OPB_ATTR ;
update statistics low for table OPB_BATCH_OBJECT ;
.
.
.
Save the output to a file, then edit the file and remove the header information (i.e., the
top line that looks like:
(constant) tabname (constant)

DB2
Run the following query :
select 'runstats on table ', (rtrim(tabschema)||'.')||tabname, ' and indexes all;'
from sysstat.tables where tabname like 'OPB_%'
This will produce output like:
runstats on table PARTH.OPB_ANALYZE_DEP
and indexes all;
runstats on table PARTH.OPB_ATTR
and indexes all;
runstats on table PARTH.OPB_BATCH_OBJECT
and indexes all;

.
.
.
Save the output to a file.
Run this as a SQL script to update statistics for the repository tables.


Deploying PowerAnalyzer Objects
Challenge
To understand the methods for deploying PowerAnalyzer objects between repositories
and the limitations.
Description
The following PowerAnalyzer repository objects can be exported to and imported from
Extensible Markup Language (XML) files. Export/import facilitates archiving the
PowerAnalyzer repository and deploying PowerAnalyzer Dashboards and reports from
development to production.
The following repository objects in PowerAnalyzer can be exported and imported:
Schemas
Reports
Time Dimensions
Global Variables
Dashboards
Security profiles
Schedules
Users
Groups
Roles
It is advisable not to modify the XML file created after exporting objects. Any change
might invalidate the XML file and result in failure of import objects into a PowerAnalyzer
repository.
For more information on exporting objects from the PowerAnalyzer repository, refer to
Chapter 13 in PowerAnalyzer Administration Guide.
EXPORTING SCHEMA(S):
To export the definition of a star schema or an operational schema, you need to select
a metric or folder from the Metrics system folder in the Schema Directory. When you
export a folder, you export the schema associated with the definitions of the metrics in
that folder and its subfolders. If the folder you select for export does not contain any

objects, PowerAnalyzer does not export any schema definition and displays the
following message:
There is no content to be exported.
There are two ways to export metrics or folders containing metrics. First, you can
select the Export Metric Definitions and All Associated Schema Table and Attribute
Definitions option. If you select to export a metric and its associated schema objects,
PowerAnalyzer exports the definitions of the metric and the schema objects associated
with that metric. If you select to export an entire metric folder and its associated
objects, PowerAnalyzer exports the definitions of all metrics in the folder, as wel l as
schema objects associated with every metric in the folder.
The other way to export metrics or folders containing metrics is to select the Export
Metric Definitions Only option. When you choose to export only the definition of the
selected metric, PowerAnalyzer does not export the definition of the schema table from
which the metric is derived or any other associated schema object.
Steps:
1. Login to PowerAnalyzer as a System Administrator
2. Click on the Administration tab XML Export/Import Export Schemas
3. All the metric folders in the schema directory are displayed. Click Refresh
Schema option to display the latest list of folders and metrics in the schema
directory
4. Select the check box for the folder or metric to be exported and click Export as
XML option
5. Enter XML filename and click Save option to save the XML file
6. The XML file will be stored locally on the client machine

EXPORTING REPORT(S):
To export the definitions of more than one report, select multiple reports or folders.
PowerAnalyzer exports only report definitions. It does not export the data or the
schedule for cached reports. As part of the Report Definition export, PowerAnalyzer
exports the Report table, Report chart, Filters, Indicators (gauge, chart, and table
indicators), Custom metrics, Links to similar reports, All reports in an analytic workflow,
including links to similar reports.
Reports might have public or personal indicators associated with them. By default,
PowerAnalyzer exports only public indicators associated with a report. To export the
personal indicators as well, select the Export Personal Indicators check box.
To export an analytic workflow, you need to export only the originating report. When
you export the originating report of an analytic workflow, PowerAnal yzer exports the
definitions of all the workflow reports. If a report in the analytic workflow has similar
reports associated with it, PowerAnalyzer exports the links to the similar reports.

PowerAnalyzer does not export alerts, schedules, or global variables associated with the
report. Although PowerAnalyzer does not export global variables, it lists all global
variables it finds in the report filter. You can export these global variables separately.
Steps:
2. Click Administration XML Export/Import Export Reports
3. Select the folder or report to be exported
4. Click Export as XML option

EXPORTING GLOBAL VARIABLE(S):
Steps:
2. Click Administration XML Export/Import Export Global Variables
3. Select the Global variable to be exported

EXPORTING A DASHBOARD:
When a dashboard is exported, PowerAnalyzer exports Reports, Indicators, Shared
Documents, and Gauges associated with the dashboard. PowerAnalyzer does not
export Alerts, Access Permissions, Attributes and Metrics in the Report(s), or Real-time
Objects. You can export any of the public dashboards defined in the repository and you
can export more than one dashboard at one time.
Steps:
2. Click Administration XML Export/Import Export Dashboards
3. Select the Dashboard to be exported

EXPORTING A USER SECURITY PROFILE:
PowerAnalyzer keeps a security profile for each user or group in the repository. A
security profile consists of the access permissions and data restrictions that the system
administrator sets for a user or group.

When exporting a security profile, PowerAnalyzer exports access permissions for
objects under the Schema Directory, which include folders, metrics, and attributes.
PowerAnalyzer does not export access permissions for filtersets, reports, or shared
documents.
PowerAnalyzer allows you to export only one security profile at a time. If a user or
group security profile you export does not have any access permissions or data
restrictions, PowerAnalyzer does not export any object definitions and displays the
following message:
There is no content to be exported.

Steps:
2. Click Administration XML Export/Import Export Security Profile
3. Click Export from users and select the user for which security profile to be
exported

EXPORTING A SCHEDULE:
You can export a time-based or event-based schedule to an XML file. PowerAnalyzer
runs a report with a time-based schedule on a configured schedule. PowerAnalyzer runs
a report with an event-based schedule when a PowerCenter session completes. When
you export a schedule, PowerAnalyzer does not export the history of the schedule.
Steps:
2. Click Administration XML Export/Import Export Schedules
3. Select the Schedule to be exported

EXPORTING A USER/GROUP/ROLE:

Exporting Users

You can export the definition of any user you define in the repository. However, you
cannot export the definitions of system users defined by PowerAnalyzer. If you have
over a thousand users defined in the repository, PowerAnalyzer allows you to search for

the users that you want to export. You can use the asterisk (*) or the percent symbol
(%) as wildcard characters to search for users to export.
You can export the definitions of more than one user, including the following
information:
Login name
Description
First, middle, and last name
Title
Password
Change password privilege
Password never expires indicator
Account status
Groups to which the user belongs
Roles assigned to the user
Query governing settings
PowerAnalyzer does not export the email address, reply-to address, department, or
color scheme assignment associated with the exported user.
Steps:
2. Click Administration XML Export/Import Export User/Group/Role
3. Click Export Users/Group(s)/Role(s) option
4. Select the user to be exported

Exporting Groups

You can export any group defined in the repository. You can export the definitions of
more than one group. You can also export the definitions of all the users within a
selected group. You can use the asterisk (*) or the percent symbol (%) as wildcard
characters to search for groups to export. You can export the definitions of more than
one group. Each user definition includes the following information:
Name
Description
Department
Color scheme assignment
Group hierarchy
Roles assigned to the group
Users assigned to the group
Query governing settings
PowerAnalyzer does not export the color scheme associated with an exported group.

Steps:
4. Select the group to be exported

Exporting Roles

You can export the definitions of the custom roles that you define in the repository. You
cannot export the definitions of system roles defined by PowerAnalyzer. You can export
the definitions of more than one role. Each role definition includes the name and
description of the role and the permissions assigned to each role.
Steps:
4. Select the role to be exported
7. The XML file will be stored local ly on the client machine

IMPORTING OBJECTS
You can import objects into the same repository or a different repository. If you import
objects that already exist in the repository, you can choose to overwrite the existing
objects. However, you can import only global variables that do not already exist in the
repository.
When you import objects, you can validate the XML file against the DTD provided by
PowerAnalyzer. Informatica recommends that you do not modify the XML files after you
export from PowerAnalyzer. Ordinarily, you do not need to validate an XML file that you
create by exporting from PowerAnalyzer. However, if you are not sure of the validity of
an XML file, you can validate it against the PowerAnalyzer DTD file when you start the
import process.
To import repository objects, you must have the System Administrator role or the
Access XML Export/Import privilege.
When you import a repository object, you become the owner of the object as if you
created it. However, other system administrators can also access imported repository
objects. You can limit access to reports for users who are not system administrators. If
you select to publish imported reports to everyone, all users in PowerAnalyzer have

read and write access to them. You can change the access permissions to reports after
you import them.

IMPORTING SCHEMAS
When importing schemas, if the XML file contains only the metric definition, you must
make sure that the fact table for the metric exists in the target repository. You can
import a metric only if its associated fact table exists in the target repository or the
definition of its associated fact table is also in the XML file.
When you import a schema, PowerAnalyzer displays a list of all the definitions
contained in the XML file. It then displays a list of all the object definitions in the XML
file that already exist in the repository. You can choose to overwrite objects in the
repository. If you import a schema that contains time keys, you must import or create
a time dimension.
Steps:
2. Click Administration XML Export/Import Import Schema
3. Click Browse to choose an XML file to import
4. Select the Validate XML against DTD option
5. Click Import XML option
6. Verify all attributes on the summary page, and choose Continue

IMPORTING REPORTS
A valid XML file of exported report objects can contain definitions of cached or on-
demand reports, including prompted reports. When you import a report, you must
make sure that all the metrics and attributes used in the report are defined in the
target repository. If you import a report that contains attributes and metrics not
defined in the target repository, you can cancel the import process. If you choose to
continue the import process, you might not be able to run the report correctly. To run
the report, you must import or add the attribute and metric definitions to the target
repository.
You are the owner of all the reports you import, including the personal or public
indicators associated with the reports. You can publish the imported reports to all
PowerAnalyzer users. If you publish reports to everyone, PowerAnalyzer provides read
access to the reports to all users. However, it does not provide access to the folder that
contains the imported reports. If you want another user to access an imported report,
you can put the imported report in a public folder and have the user save or move the
imported report to the users personal folder. Any public indicator associated with the
report also becomes accessible to the user.
If you import a report and its corresponding analytic workflow, the XML file contains all
workflow reports. If you choose to overwrite the report, PowerAnalyzer also overwrites
the workflow reports. Also, when importing multiple workflows, note that

PowerAnalyzer does not import analytic workflows containing the same workflow report
names. Thus, ensure that all imported analytic workflows have unique report names
prior to being imported.
Steps:
2. Click Administration XML Export/Import Import Report

IMPORTING GLOBAL VARIABLES

You can import global variables that are not defined in the target repository. If the XML
file contains global variables already in the repository, you can cancel the process. If
you continue the import process, PowerAnalyzer imports only the global variables not in
the target repository.
Steps:
2. Click Administration XML Export/Import Import Global Variables

IMPORTING DASHBOARDS

Dashboards display links to reports, shared documents, alerts, and indicators. When
you import a dashboard, PowerAnalyzer imports the following objects associated with
the dashboard:
Reports
Indicators
Shared documents
Gauges
PowerAnalyzer does not import the following objects associated with the dashboard:
Alerts
Access permissions
Attributes and metrics in the report
Real-time objects

If an object already exists in the repository, PowerAnalyzer provides an option to
overwrite the object. PowerAnalyzer does not import the attributes and metrics in the
reports associated with the dashboard. If the attributes or metrics in a report
associated with the dashboard do not exist, the report does not display on the imported
dashboard.
Steps:
2. Click Administration XML Export/Import Import Dashboard

IMPORTING SECURITY PROFILE(S):

When you import a security profile, you must first select the user or group to which you
want to assign the security profile. You can assign the same security profile to more
than one user or group.
When you import a security profile and associate it with a user or group, you can either
overwrite the current security profile or add to it. When you overwrite a security profile,
you assign the user or group only the access permissions and data restrictions found in
the new security profile. PowerAnalyzer removes the old restrictions associated with the
user or group. When you append a security profile, you assign the user or group the
new access permissions and data restrictions in addition to the old permissions and
restrictions.
When exporting a security profile, PowerAnalyzer exports the security profile for objects
in Schema Directory, including folders, attributes, and metrics. However, it does not
include the security profile for filtersets.
Steps:
2. Click Administration XML Export/Import Import Security Profile
3. Click Import to Users
4. Select the user with which you want to associate the security profile you import.
o To associate the imported security profiles with all the users in the page,
select the check box under Users at the top of the list.
o To associate the imported security profiles with all the users in the
repository, select Import to All.
o To overwrite the selected users current security profile with the imported
security profile, select Overwrite.
o To append the imported security profile to the selected users current
security profile, select Append.


IMPORTING SCHEDULE(S):

A time-based schedule runs reports based on a configured schedule. An event-based
schedule runs reports when a PowerCenter session completes. You can import a time-
based or event-based schedules from an XML file. When you import a schedule,
PowerAnalyzer does not attach the schedule to any reports.
Steps:
2. Click Administration XML Export/Import Import Schedule

IMPORTING USER(S)/GROUP(S)/ROLE(S):

When you import a user, group, or role, you import all the information associated with
each user, group, or role. The XML file includes definitions of roles assigned to users or
groups, and definitions of users within groups. For this reason, you can import the
definition of a user, group, or role in the same import process.
When you importing a user, you import the definitions of roles assigned to the user and
the groups to which the user belongs. When you import a user or group, you import
the user or group definitions only. The XML file does not contain the color scheme
assignments, access permissions, or data restrictions for the user or group. To import
the access permissions and data restrictions, you must import the security profile for
the user or group.
Steps:
2. Click Administration XML Export/Import Import User/Group/Role

Tips for Importing/Exporting

Schedule Importing/Exporting of repository objects should be scheduled at a time
of minimal PowerAnalyzer activity, when most of the users are not accessing the
PowerAnalyzer repository. This will prevent the likelihood of users experiencing timeout
errors or degraded response time. Only the System Administrator should perform the
export/import operation.
Take a backup of the PowerAnalyzer repository before performing the
export/import operation. This backup should be completed using the Repository
Backup Utility provided with PowerAnalyzer.
Manually add user/group permissions for the report. They will not be exported as
part of exporting Reports and should be manually added after the report is imported in
the desired server.
Use a version control tool. Prior to importing objects into a new environment, it is
advisable to check the XML documents into a version control tool such as Microsoft
Visual Source Safe, or PVCS. This will facilitate the versioning of repository objects and
provide a means to rollback to a prior version of an object, if necessary.
PowerAnalyzer does not import the schedule with a cached report. When you import
cached reports, you must attach them to schedules in the target repository. You can
attach multiple imported reports to schedules in the target repository in one process
immediately after you import them.
If you import a report that uses global variables in the attribute filter, ensure that the
global variables already exist in the target repository. If they are not in the target
repository, you must either import the global variables from the source repository or
recreate them in the target repository.
You must add indicators to the dashboard manually. When you import a
dashboard, PowerAnalyzer imports all indicators for the originating report and workflow
reports in a workflow. However, indicators for workflow reports do not display on the
dashboard after you import it until added manually.
Check with your system administrator to understand what level of LDAP
integration has been configured, if any. Users, groups, and roles will need to be
exported and imported during deployment when using repository authentication. If
PowerAnalyzer has been integrated with an LDAP (Lightweight Directory Access
Protocol) tool, then users, groups, and/or roles may not require deployment.
When you import users into a Microsoft SQL Server or IBM DB2 repository,
PowerAnalyzer blocks all user authentication requests until the import process is
complete.


Installing PowerAnalyzer
Challenge
Installing PowerAnalyzer on new or existing hardware, either as a dedicated application
on a physical machine (as Informatica recommends) or co-existing with other
applications on the same physical server or with other Web applications on the same
application server.
Description
Consider the following questions when determining what type of hardware to use for
PowerAnalyzer:
If the hardware already exists:
1. Is the processor, operating system, and database software supported by
PowerAnalyzer?
2. Are the necessary operating system and database patches applied?
3. How many CPUs does the machine currently have? Can the CPU capacity be
expanded?
4. How much memory does the machine have? How much is available to the
PowerAnalyzer application?
5. Will PowerAnalyzer run alone or share the machine with other applications? If
yes, what are the CPU and memory requirements of the other applications?
If the hardware does not already exist:
1. Has the organization standardized on hardware or operating system vendor?
2. What type of operating system is preferred and supported? i.e., Solaris 9,
Windows 2003, AIX 5.2, HP-UX 11i, Redhat AS 3.0, SuSE 8
3. What database and version is preferred and supported for the PowerAnalyzer
repository?
Regardless of the hardware vendor chosen, the hardware must be configured and sized
appropriately to support the reporting response time requirements for PowerAnalyzer.
The following questions should be answered in order to estimate the size of a
PowerAnalyzer server:
1. How many users are predicted for concurrent access?

2. On average, how many rows will be returned in each report?
3. On average, how many charts will there be for each report?
4. Do the business requirements mandate a SSL Web server?
The hardware requirements for the PowerAnalyzer environment depend on the number
of concurrent users, types of reports being used (interactive vs. static), average
number of records in a report, application server and operating system used, among
other factors. The following table should be used as a general guide for hardware
recommendations for a PowerAnalyzer installation. Actual results may vary depending
upon exact hardware configuration and user volume. For exact sizing
recommendations, please contact Informatica Professional Services for a PowerAnalyzer
Sizing and Baseline Architecture engagement.
Windows 2000
# of
Concurrent
Users
Average
Number of
Rows per
Report
Average #
of Charts
per Report
Estimated #
of CPUs for
Peak Usage
Estimated Total
RAM (For
PowerAnalyzer
alone)
Estimated #
of App
servers in a
Clustered
Environment
50 1000 2 2 1 GB 1
100 1000 2 3 2 GB 1 - 2
200 1000 2 6 3.5 GB 3
400 1000 2 12 6.5 GB 6
100 1000 2 3 2 GB 1 - 2
100 2000 2 3 2.5 GB 1 - 2
100 5000 2 4 3 GB 2
100 10000 2 5 4 GB 2 - 3
100 1000 2 3 2 GB 1 - 2
100 1000 5 3 2 GB 1 - 2
100 1000 7 3 2.5 GB 1 - 2
100 1000 10 3 - 4 3 GB 1 - 2
Notes:
1. This estimating guide is based on certain experiments conducted in the
Informatica lab.
2. The sizing estimates are based on PowerAnalyzer 5 running BEA WebLogic 8.1
SP3, Windows 2000, on a 4 CPU 2.5 GHz Xeon Processor. This estimate may not
be accurate for other, different environments.
3. The number of concurrent users under peak volume can be estimated by using
the number of total users multiplied by the percentage of concurrent users. In
practice, typically 10% of the user base is concurrent. However, this percentage
can be as high as 50% or as low as 5% in some organizations.
4. For every 2 CPUs on the server, Informatica recommends 1 managed server
(instance) of the application server. For servers with at least four CPUs,
clustering multiple logical instances of the application server on one physical
server can result in increased performance.
5. Add 30 - 50 % overhead on for a SSL Web server architecture, depending on
strength of encryption.

6. CPU utilization can be minimized by 10 - 25% by using SVG charts, otherwise
known as interactive charting, rather than the default PNG charting.
7. Clustering is recommended for instances with more than 50 concurrent users.
(Clustering doesnt have to be across multiple boxes if >= 4 CPU)
8. Informatica Professional Services should be engaged for a thorough and
accurate sizing estimate.
IBM AIX 5.2
# of
Concurrent
Users
Average
Number of
Rows per
Report
Average #
of Charts
per Report
Estimated #
of CPUs for
Peak Usage
Estimated Total
RAM (For
PowerAnalyzer
alone)
Estimated #
of App
servers in a
Clustered
Environment
50 1000 2 2 1 GB 1
100 1000 2 2 - 3 2 GB 1
200 1000 2 4 - 5 3.5 GB 2 - 3
400 1000 2 9 - 10 6 GB 4 - 5
100 1000 2 2 - 3 2 GB 1
100 2000 2 2 - 3 2 GB 1 - 2
100 5000 2 2 - 3 3 GB 1 - 2
100 10000 2 4 4 GB 2
100 1000 2 2 - 3 2 GB 1
100 1000 5 2 - 3 2 GB 1
100 1000 7 2 - 3 2 GB 1 - 2
100 1000 10 2 - 3 2.5 GB 1 - 2
Notes:
1. This estimating guide is based on certain experiments conducted in the
Informatica lab.
2. The sizing estimates are based on PowerAnalyzer 5 running IBM WebSphere
5.1.1.1 and AIX 5.2.02 on a 4 CPU 2.4 GHz IBM p630. This estimate may not be
accurate for other, different environments.
3. The number of concurrent users under peak volume can be estimated by using
the number of total users multiplied by the percentage of concurrent users. In
practice, typically 10% of the user base is concurrent. However, this percentage
can be as high as 50% or as low as 5% in some organizations.
4. For every 2 CPUs on the server, Informatica recommends 1 managed server
(instance) of the application server. For servers with at least four CPUs,
clustering multiple logical instances of the application server on one physical
server can result in increased performance.
5. Add 30 - 50 % overhead on for a SSL Web server architecture, depending on
strength of encryption.
6. CPU utilization can be minimized by 10 - 25% by using SVG charts, otherwise
known as interactive charting, rather than the default PNG charting.
7. Clustering is recommended for instances with more than 50 concurrent users.
(Clustering doesnt have to be across multiple boxes if >= 4 CPU)
8. Informatica Professional Services should be engaged for a thorough and
accurate sizing estimate.

PowerAnalyzer Installation
There are two main components of the PowerAnalyzer installation process: the
PowerAnalyzer Repository and the PowerAnalyzer Server, which is an application
deployed on an application server. A Web server is necessary to support these
components and is included with the installation of the application servers. This section
discusses the installation process for BEA WebLogic and IBM WebSphere. The
installation tips apply to both Windows and UNIX environments. This section is intended
to serve as a supplement to the PowerAnalyzer Installation Guide.
Before installing PowerAnalyzer, please complete the following steps:
Verify that the hardware meets the minimum system requirements for
PowerAnalyzer.Ensure that the combination of hardware, operating system,
application server, repository database, and, optionally, authentication software
are supported by PowerAnalyzer.Ensure that sufficient space has been allocated
to the PowerAnalyzer repository.
Apply all necessary patches to the operating system and database software.
Verify connectivity to the data warehouse database (or other reporting source) and
repository database.
If LDAP or NT Domain is used for PowerAnalyzer authentication, verify connectivity
to the LDAP directory server or the NT primary domain controller.
The PowerAnalyzer license file has been obtained from
productrequests@informatica.com.
On UNIX/Linux installations, the OS user that is installing PowerAnalyzer must
have execute privileges on all PowerAnalyzer installation executables.
In addition to the standard PowerAnalyzer components that are installed by default,
other components of PowerAnalyzer that can be installed include:
PowerCenter Integration Utility
PowerAnalyzer SDK
PowerAnalyzer Portal Integration Kit
PowerAnalyzer Metadata Reporter (PAMR)
Please see the PowerAnalyzer documentation for more detailed installation instructions
for these components.
Installation Steps BEA WebLogic
The following are the basic installation steps for PowerAnalyzer on BEA WebLogic:
1. Setup the PowerAnalyzer repository database. The PowerAnalyzer Server
installation process will create the repository tables, but an empty database
schema needs to exist and be able to be connected to via JDBC prior to
installation.
2. Install BEA WebLogic and apply the BEA license.
3. Install PowerAnalyzer.
4. Apply the PowerAnalyzer license key.
5. Install the PowerAnalyzer Online Help.


TIP
When creating a repository in an Oracle database, make sure the storage
parameters specified for the tablespace that contains the repository are not set too
large. Since many target tablespaces are initially set for very large INITIAL and NEXT
values, large storage parameters cause the repository to use excessive amounts of
space. Also verify that the default tablespace for the user that owns the repository
tables is set correctly.
The following example shows how to set the recommended storage parameters,
assuming the repository is stored in the REPOSITORY tablespace:
ALTER TABLESPACE REPOSITORY DEFAULT STORAGE ( INITIAL 10K NEXT 10K
MAXEXTENTS UNLIMITED PCTINCREASE 50 );

Installation Tips BEA WebLogic
The following are the basic installation tips for PowerAnalyzer on BEA WebLogic:
Beginning with PowerAnalyzer 5, multiple PowerAnalyzer instances can be installed
on a single instance of WebLogic. Also, other applications can co-exist with
PowerAnalyzer on a single instance of WebLogic. Although this architecture
should be factored in during hardware sizing estimates, it allows greater
flexibility during installation.
For WebLogic installations on UNIX, the BEA WebLogic Server installation program
requires an X-Windows server. If BEA WebLogic Server is installed on a machine
where an X-Windows server is not installed, an X-Windows server must be
installed on another machine in order to render graphics for the GUI-based
installation program. For more information on installing on UNIX, please see the
UNIX Servers section of the installation and configuration tips below.
If the PowerAnalyzer installation files are transferred to the PowerAnalyzer Server,
they must be FTPd in binary format
To view additional debugging information during UNIX installations, the
LAX_DEBUG environment variable can be set to true (LAX_DEBUG=true).
During the PowerAnalyzer installation process, the user will be prompted to choose
an authentication method for PowerAnalyzer, such as repository, NT Domain, or
LDAP. If LDAP or NT Domain authentication is used, it is best to have the
configuration parameters available during installation as the installer will
configure all properties files at installation.
The PowerAnalyzer license file and BEA WebLogic license must be applied prior to
starting PowerAnalyzer.
Installation Steps IBM WebSphere
The following are the basic installation steps for PowerAnalyzer on IBM WebSphere:
1. Setup the PowerAnalyzer repository database. The PowerAnalyzer Server
installation process will create the repository tables, but the empty database

schema needs to exist and be able to be connected to via JDBC prior to
installation.
2. Install IBM WebSphere and apply the all WebSphere patches. WebSphere can be
installed in its Base configuration or Network Deployment configuration if
clustering will be utilized. In both cases, patchsets will need to be applied.
3. Install PowerAnalyzer.
4. Apply the PowerAnalyzer license key.
5. Install the PowerAnalyzer Online Help.
Installation Tips IBM WebSphere

Starting in PowerAnalyzer 5, multiple PowerAnalyzer instances can be installed on
a single instance of WebSphere. Also, other applications can co-exist with
PowerAnalyzer on a single instance of WebSphere. Although this architecture
should be considered during sizing estimates, it allows greater flexibility during
installation.
For WebSphere installations on UNIX, the IBM WebSphere installation program
requires an X-Windows server. If IBM WebSphere is installed on a machine
where an X-Windows server is not installed, an X-Windows server must be
installed on another machine in order to render graphics for the GUI based
installation program. For more information on installing on UNIX, please see the
UNIX Servers section of the installation and configuration tips below.
For WebSphere on UNIX installations, PowerAnalyzer must be installed using the
root user or system administrator account. Two groups (mqm and mqbrkrs)
must be created prior to the installation and the root account should be added to
both of these groups.
For WebSphere on Windows installations, ensure that PowerAnalyzer is installed
under the padaemon local Windows user ID that is in the Administrative group
and has the advanced user rights "Act as part of the operating system" and "Log
on as a service." During the installation, the padaemon account will need to be
added to the mqm group.
If the PowerAnalyzer installation files are transferred to the PowerAnalyzer Server,
they must be FTPd in binary format.
To view additional debugging information during UNIX installations, the
LAX_DEBUG environment variable can be set to true (LAX_DEBUG=true).
During the WebSphere installation process, the user will be prompted to enter a
directory for the application server and the HTTP (web) server. In both
instances, it is best to keep the default installation directory. Directory names
for the application server and HTTP server that include spaces may result in
errors.
During the PowerAnalyzer installation process, the user will be prompted to choose
an authentication method for PowerAnalyzer, such as repository, NT Domain, or
LDAP. If LDAP or NT Domain authentication is utilized, it is best to have the
configuration parameters available during installation as the installer will
configure all properties files at installation.
The PowerAnalyzer license file and BEA WebLogic license must be applied prior to
starting PowerAnalyzer.
Installation and Configuration Tips - UNIX Servers

A graphics display server is required for a PowerAnalyzer installation on UNIX. On UNIX,
the graphics display server is typically an X-Windows server, although an X-Window
Virtual Frame Buffer (XVFB) or personal computer X-Windows software such as WRQ
Reflection-X can also be used. In any case, the X-Windows server does not need to
exist on the local machine where PowerAnalyzer is being installed, but does need to be
accessible. A remote X-Windows, XVFB, or PC-X Server can be used by setting the
DISPLAY to the appropriate IP address, as discussed below.
If the X-Windows server is not installed on the machine where PowerAnalyzer will be
installed, PowerAnalyzer can be installed using an X-Windows server installed on
another machine. Simply redirect the DISPLAY variable to use the X-Windows server on
another UNIX machine.
To redirect the host output, define the environment variable DISPLAY. On the command
line, type the following command and press Enter:
C shell:
setenv DISPLAY=<TCP/IP node of X-Windows server>:0
Bourne/Korn shell:
export DISPLAY=<TCP/IP node of X-Windows server>:0
Configuration

PowerAnalyzer requires a means to render graphics for charting and indicators.
When graphics rendering is not configured properly, charts and indicators will
not be displayed properly on dashboards or reports. For PowerAnalyzer
installations using an application server with JDK 1.4 and greater, the
java.awt.headless=true setting can be set in the application server startup
scripts to facilitate graphics rendering for PowerAnalyzer. If the application
server does not use JDK 1.4 or later, use an X-Windows server or XVFB to
render graphics. The DISPLAY environment variable should be set to the IP
address of the X-Windows or XVFB server prior to starting PowerAnalyzer.
The application server heap size is the memory allocation for the JVM. The
recommended heap size greatly depends on the memory available on the
machine hosting the application server and server load, but the recommended
starting point is 512MB. This setting is the first setting that should be examined
when tuning a PowerAnalyzer instance.


PowerAnalyzer Security
Challenge
Using PowerAnalyzer's sophisticated security architecture to establish a robust security
system to safeguard valuable business information against a full range of technologies
and security models. Ensuring that PowerAnalyzer security provides appropriate
mechanisms to support and augment the security infrastructure of a Business
Intelligence environment at every level.

Description
Four main architectural layers must be completely secure: user layer, transmission
layer, application layer and data layer.

User layer
Users must be authenticated and authorized to access data. PowerAnalyzer integrates
seamlessly with the following LDAP compliant directory servers:
SunOne/iPlanet Directory Server 4.1
Sun Java System Directory Server 5.2
Novell eDirectory Server 8.7
IBM SecureWay Directory 3.2
IBM SecureWay Directory 4.1
IBM Tivoli Directory Server 5.2
Microsoft Active Directory 2000
Microsoft Active Directory 2003
In addition to the directory server, PowerAnalyzer supports Netegrity SiteMinder for
centralizing authentication and access control for the various web applications in the
organization.
Transmission layer
Data transmission must be secured and hacker-proof. PowerAnalyzer supports the
standard security protocol Secure Sockets Layer (SSL) to provide a secure
environment.

Application layer
Only appropriate application functionality should be provided to users with associated
privileges. PowerAnalyzer provides three basic types of application-level security:
Report, Folder & Dashboard Security restricts users and groups to specific
reports or folders and dashboards that they can access.
Column-level Security restricts users and groups to particular metric and
attribute columns.
Row-level Security restricts users to specific attribute values within an
attribute column of a table.
Components for Managing Application Layer Security
PowerAnalyzer users can perform different tasks based on the privileges that you grant
them. PowerAnalyzer provides the following components for managing application layer
security:
Roles: A role can consist of one or more privileges. You can use system roles or
create custom roles. You can grant roles to groups and/or individual users.
When you edit a custom role, all groups and users with the role automatically
inherit the change.
Groups: A group can consist of users and/or groups. You can assign one or more
roles to a group. Groups are created to organize logical sets of users and roles.

After you create groups, you can assign users to the groups. You can also assign
groups to other groups to organize privileges for related users. When you edit a
group, all users and groups within the edited group inherit the change.
Users: A user has a user name and password. Each person accessing
PowerAnalyzer must have a unique user name. To set the tasks a user can
perform, you can assign roles to the user or assign the user to a group with
predefined roles.
Types of Roles
System roles
PowerAnalyzer provides the following roles when the repository is created. Each
role has sets of privileges assigned to it.
Custom roles
The end user can create and assign privileges to these roles.
Managing Groups
Groups allow you to classify users according to a particular function. You may organize
users into groups based on their departments or management level. When you assign
roles to a group, you grant the same privileges to all members of the group. When you
change the roles assigned to a group, all users in the group inherit the changes. If a
user belongs to more than one group, the user has the privileges from all groups. To
organize related users into related groups, you can create group hierarchies. With
hierarchical groups, each subgroup automatically receives the roles assigned to the
group it belongs to. When you edit a group, all subgroups contained within it inherit the
changes.
For example, you may create a Lead group and assign it the Advanced Consumer role.
Within the Lead group, you create a Manager group with a custom role Manage
PowerAnalyzer. Because the Manager group is a subgroup of the Lead group, it has
both the Manage PowerAnalyzer and Advanced Consumer role privileges.

Belonging to multiple groups has an inclusive effect. For example if group 1 has
access to something but group 2 is excluded from that object, a user belonging to both
groups 1 and 2 will have access to the object.


Managing Users
Each user must have a unique user name to access PowerAnalyzer. To perform
PowerAnalyzer tasks, a user must have the appropriate privileges. You can assign
privileges to a user with roles or groups.
PowerAnalyzer creates a system administrator user account when you create the
repository. The default user name for the system administrator user account is admin.
The system daemon, ias_scheduler, runs the updates for all time-based schedules.
System daemons must have a unique user name and password in order to perform
PowerAnalyzer system functions and tasks. You can change the password for a system
daemon, but you cannot change the system daemon user name via the GUI.
PowerAnalyzer permanently assigns the Daemon role to system daemons. You cannot
assign new roles to system daemons or assign them to groups.
To change the password for a system daemon, you must complete the following steps:
1. Change the password in the Administration tab in PowerAnalyzer.
2. Change the password in the web.xml file in the PowerAnalyzer folder.
3. Restart PowerAnalyzer.

Customizing User Access
You can customize PowerAnalyzer user access with the following security options:
Access permissions: Restrict user and/or group access to folders, reports,
dashboards, attributes, metrics, template dimensions, or schedules. Use access
permissions to restrict access to a particular folder or object in the repository.
Data restrictions: Restrict user and/or group access to information in fact and
dimension tables and operational schemas. Use data restrictions to prevent
certain users or groups from accessing specific values when they create reports.
Password restrictions: Restrict users from changing their passwords. Use
password restrictions when you do not want users to alter their passwords.

When you create an object in the repository, every user has default read and write
permissions on that object. By customizing access permissions on an object, you
determine which users and/or groups can read, write, delete, or change access
permissions on that object.
When you set data restrictions, you determine which users and groups can view
particular attribute values. If a user with a data restriction runs a report, PowerAnalyzer
does not display the restricted data to that user.
Types of Access Permissions
Access permissions determine the tasks you can perform for a specific repository
object. When you set access permissions, you determine which users and groups have
access to the folders and repository objects. You can assign the following types of
access permissions to repository objects:
Read: Allows you to view a folder or object.
Write: Allows you to edit an object. Also allows you to create and edit folders and
objects within a folder.
Delete: Allows you to delete a folder or an object from the repository.
Change permission: Allows you to change the access permissions on a folder or
object.
By default, PowerAnalyzer grants read and write access permissions to every user in
the repository. You can use the General Permissions area to modify default access
permissions for an object, or turn off default access permissions.
Data Restrictions
You can restrict access to data based on the values of related attributes. Data
restrictions are set to keep sensitive data from appearing in reports. For example, you
want to restrict data related to the performance of a new store from outside vendors.
You can set a data restriction that excludes the store ID from their reports.
You can set data restrictions using one of the following methods:
Set data restrictions by object. Restrict access to attribute values in a fact
table, operational schema, real-time connector, and real-time message stream.
You can apply the data restriction to users and groups in the repository. Use this
method to apply the same data restrictions to more than one user or group.
Set data restrictions for one user at a time. Edit a user account or group to
restrict user or group access to specified data. You can set one or more data
restrictions for each user or group. Use this method to set custom data
restrictions for different users or groups
Two Types of Data Restrictions
You can set two kinds of data restrictions:

Inclusive: Use the IN option to allow users to access data related to the attributes
you select. For example, to allow users to view only data from the year 2001,
create an IN 2001 rule.
Exclusive: Use the NOT IN option to restrict users from accessing data related to
the attributes you select. For example, to allow users to view all data except
from the year 2001, create a NOT IN 2001 rule.
Restricting Data Access by User or Group
You can edit a user or group profile to restrict the data the user or group can access in
reports. When you edit a user profile, you can set data restrictions for any schema in
the repository, including operational schemas and fact tables.
You can set a data restriction to limit user or group access to data in a single schema
based on the attributes you select. If the attributes apply to more than one schema in
the repository, you can also restrict the user or group access from related data across
all schemas in the repository. For example, you have a Sales fact table and Salary fact
table. Both tables use the Region attribute. You can set one data restriction that applies
to both the sales and salary fact tables based on the region you select.
To set data restrictions for a user or group, you need the following role or privilege:
System Administrator role
Access Management privilege
When PowerAnalyzer runs scheduled reports that have provider-based security, it runs
reports against the data restrictions for the report owner. However, if the reports have
consumer-based security then the PowerAnalyzer Server will create a separate report
for each unique security profile.
The following information applies to the required steps for changing admin user for
weblogic only.
To change the PowerAnalyzer default users from admin, ias_scheduler.:
1. Back up the repository.
2. Go to the Web Logic library directory: .\bea\wlserver6.1\lib
3. Open the file ias.jar and locate the file entry called
InfChangeSystemUserNames.class
4. Extract the file "InfChangeSystemUserNames.class" into a temporary directory
(example: d:\temp)
5. This will extract the file as 'd:\temp\repository
tils\Refresh\InfChangeSystemUserNames.class'
6. Create a batch file (change_sys_user.bat) with the following commands in the
directory D:\Temp\Repository Utils\Refresh\
REM To change the system user name and password
REM *******************************************
REM Change the BEA home here

REM ************************
set JAVA_HOME=E:\bea\wlserver6.1\jdk131_06
set WL_HOME=E:\bea\wlserver6.1
set CLASSPATH=%WL_HOME%\sql
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\jconn2.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\classes12.zip
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\weblogic.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias_securityadapter.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\infalicense
REM Change the DB information here and also
REM the user Dias_scheduler and -Dadmin to values of your choice
REM *************************************************************
%JAVA_HOME%\bin\java-Ddriver=com.informatica.jdbc.sqlserver.SQLServerDriver-
Durl=jdbc:informatica:sqlserver://host_name:port;SelectMethod=cursor;DatabaseNam
e=database_name -Duser=userName -Dpassword=userPassword -
Dias_scheduler=pa_scheduler -Dadmin=paadmin
repositoryutil.refresh.InfChangeSystemUserNames
REM END OF BATCH FILE
7. Make changes in the batch file as directed in the remarks [REM lines]
8. Save the file and open up a command prompt window and navigate to
D:\Temp\Repository Utils\Refresh\
9. At the prompt type change_sys_user.bat and enter.
The user "ias_scheduler" and "admin" will be changed to "pa_scheduler" and
"paadmin", respectively.
10. Modify web.xml, and weblogic.xml (located at
.\bea\wlserver6.1\config\informatica\applications\ias\WEB-INF) by replacing
ias_scheduler with 'pa_scheduler'
11. Replace ias_scheduler with pa_scheduler in the xml file weblogic-ejb-jar.xml
This file is in iasEjb.jar file located in the directory
.\bea\wlserver6.1\config\informatica\applications\

To edit the file:
Make a copy of the iasEjb.jar
a. mkdir \tmp
b. cd \tmp
c. jar xvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar META-INF
d. cd META-INF
e. Update META-INF/weblogic-ejb.jar.xml replace ias_scheduler with pa_scheduler
f. cd \
g. jar uvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar -C \tmp .
NOTE:There is a tailing period at the end of the command above.
12. Restart the server.

TIP
To use the PowerAnalyzer user login name as a filter for a report:
PowerAnalyzer Version 4.0 provides a System Variable USER_LOGIN that
functions as a Global Variable for the username that it used whenever the
user submits SQL to the RDBMS.
1. Build a cross-reference table in the database that includes USER_ID and
specifies those tables that the user can access
2. Use the System Variable USER_LOGIN that sources from that cross-
reference table and can be referenced both: a. in the filters area (by
overriding the SQL and inserting $USER_ID$) or b. in the WHERE clause box
under "Dimension tables" (again specifying $USER_ID$).
3. The system parses $USER_ID$ when it sees it in the report SQL and
populates it with the value from the USER_LOGIN System Variable.


Tuning and Configuring PowerAnalyzer and
PowerAnalyzer Reports
Challenge
A PowerAnalyzer report that is slow to return data means lag time to a manager or
business analyst. It can be a crucial point of fail ure in the acceptance of a data
warehouse. This Best Practice offers some suggestions for tuning PowerAnalyzer and
PowerAnalyzer reports.
Description
Performance tuning reports occurs both at the environment level and the reporting
level. Often report performance can be enhanced by looking closely at the objective of
the report rather than the suggested appearance. The following guidelines should help
with tuning the environment and the report itself.
1. Perform Benchmarking. Benchmark the reports to determine an expected rate of
return. Perform benchmarks at various points throughout the day and evening
hours to account for inconsistencies in network traffic. This will provide a
baseline to measure changes against.
2. Review Report. Confirm that all data elements are required in report. Eliminate
any unnecessary data elements, filters and calculations. Also be sure to remove
any extraneous charts or graphs. Consider if the report can be broken into
multiple reports or presented at a higher level. These are often ways to create
more visually appealing reports and allow for linked detail reports or drill down
to detail level.
3. Scheduling of Reports. If the report is on-demand but can be changed to a
scheduled report, schedule the report to run during hours when the system use
is minimized. Consider scheduling large numbers of reports to run overnight. If
mid-day updates are required, test the performance at lunch hours and consider
scheduling for that time period. Reports that require filters by users can often be
copied and filters pre-created to allow for scheduling of the report.
4. Evaluate Database. Database tuning occurs on multiple levels. Begin by
reviewing the tables used in the report. Ensure that indexes have been created
on dimension keys. If filters are used on attributes, test the creation of
secondary indices to improve the efficiency of the query. Next, execute reports
while a DBA monitors the database environment. This will provide the DBA the

opportunity to tune the database for querying. Finally, look into changes in
database settings. Increasing the database memory in the initialization file often
improves PowerAnalyzer performance significantly.
5. Investigate Network. Reports are simply database queries, which can be found
by clicking the "View SQL" button on the report. Run the query from the report,
against the database using a client tool on the server the database resides on.
One caveat to this is that even the database tool on the server may contact the
outside network. Work with the DBA during this test to use a local database
connection, (e.g., Bequeath / IPC Oracles local database communication
protocol) and monitor the database throughout this process. This test will
pinpoint if the bottleneck is occurring on the network or in the database. If for
instance, the query performs similarly regardless of where it is executed, but the
report continues to be slow, this indicates a web server bottleneck. Common
locations for network bottlenecks include router tables, web server demand, and
server input/output. Informatica does recommend installing PowerAnalyzer on a
dedicated web server.
6. Tune the Schema. Having tuned the environment and minimized the report
requirements, the final level of tuning involves changes to the database tables.
Review the under performing reports.
Can any of these be generated off of aggregate tables instead of base tables?
PowerAnalyzer makes efficient use of linked aggregate tables by determining on
a report-by-report basis if the report can utilize an aggregate table. By studying
the existing reports and future requirements, you can determine what key
aggregates can be created in the ETL tool and stored in the database.
Calculated metrics can also be created in an ETL tool and stored in the database
instead of created in PowerAnalyzer. Each time a calculation must be done in
PowerAnalyzer, it is being performed as part of the query process. To determine
if a query can be improved by building these elements in the database, try
removing them from the report and comparing report performance. Consider if
these elements are appearing in a multitude of reports or simply a few.
7. Database Queries. As a last resort for under-performing reports, you may want
to edit the actual report query. To determine if the query is the bottleneck,
select the View SQL button on the report. Next, copy the SQL into a query
utility and execute. (DBA assistance maybe beneficial here.) If the query
appears to be the bottleneck, revisit Steps 2 and 6 above to ensure that no
additional report changes are possible. Once you have confirmed that the report
is as required, work to edit the query while continuing to re-test it in a query
utility. Additional options include utilizing database views to cache data prior to
report generation. Reports are then built based on the view.
WARNING: editing the report query requires query editing for each report change and
may require editing during migrations. Be aware that this is a time-consuming process
and a difficult-to-maintain method of performance tuning.
Poweranalyzer repository database should be tuned for an OLTP workload.

Tuning JVM

JVM Layout
JVM is the repository for all live objects, dead objects, and free memory. It has the
following primary jobs:
Execute code
Manage memory
Remove garbage objects
The size of JVM determines how often and how long garbage collection will run.
JVM parameters can be set at "startWebLogic.cmd" or "startWebLogic.sh" for weblogic.
Parameters of JVM
1. -Xms and -Xmx parameters define the minimum and maximum heap size; for
large applications, the values should be set equal to each other.
2. Start with -ms=512m -mx=512m as needed, increase JVM by 128m or 256m to
reduce garbage collection. 
3. Permanent generation, which holds the JVM's class and method objects -
XX:MaxPermSize command line parameter controls the permanent generation's
size.
4. "NewSize" and "MaxNewSize" parameters control the new generation's minimum
and maximum size.
5. XX:NewRatio=5 divides the old-to-new in the order of 5:1 (i.e the old generation
occupies 5/6 of the heap while the new generation occupies 1/6 of the heap).
o When the new generation fills up, it triggers a minor collection, in which
surviving objects are moved to the old generation.
o When the old generation fills up, it triggers a major collection, which
involves the entire object heap. This is more expensive in terms of
resources than a minor collection.
6. If you increase the new generation size, the old generation size decrease. Minor
collections occur less often, but the frequency of major collection increase.
7. If you decrease the new generation size, the old generation size increase. Minor
collections occur more, but the frequency of major collection decrease.
8. As a general rule, keep the new generation smaller than half the heap size (i.e.,
1/4 or 1/3 of the heap size).
9. Enable additional JVM if you expect large number of users. Informatica typically
recommends two to three CPUs per JVM.

If you increase the new generation size, the old generation size decrease. Minor
collections occur less, but the frequency of major collection increase. If you decrease
the new generation size, the old generation size increase. Minor collections occur more,
but the frequency of major collection decrease.
Enable additional JVM if large number of users expected. Recommend 2-3 CPUs per JVM
Other areas to tune
Execute Threads
Threads available to process simultaneous operations in Weblogic
Too few threads means CPUs are under-utilized and jobs are waiting for threads to
be come available
Too many threads means system is wasting resource in managing threads. OS
does unnecessary context switch
Default is 15 threads. Informatica recommends using the default value, but you
may need to experiment to determine the optimal value for your environment.
Connection Pooling
Application borrows connection from the pool, uses it, and then returns it to the pool by
closing it.
Initial capacity = 15
Maximum capacity = 15
Sum of connections of all pools should be equal to the number of execution
threads
Connection pooling avoids the overhead of growing and shrinking the pool size
dynamically by setting the initial and maximum pool size at the same level.
Performance packs use platform-optimized (i.e., native) sockets to improve server
performance. They are available on: Windows NT/2000 (default installed), Solaris
2.6/2.7, AIX 4.3, HP/UX, and Linux.
Check Enable Native I/O on the server attribute tab
Adds <NativeIOEnabled> to config.xml as true
For Websphere, use the Performance Tuner to modify the configurable parameters.
For optimal configuration, separate application, data warehouse, and repository into
separate dedicated machines.
Application Server-Specific Tuning Details
JBoss Application Server
Web Container. Tune the web container by modifying the following configuration file
so that it accepts a reasonable number of HTTP requests as required by the

PowerAnalyzer installation. Ensure that the web container is made available to optimal
number of threads so that it can accept and process more HTTP requests.
<JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/META-INF/jboss-
service.xml
The following is a typical configuration:

<Connector className="org.apache.coyote.tomcat4.CoyoteConnector" port="8080"
minProcessors="10" maxProcessors="100" enableLookups="true" acceptCount="20"
debug="0" tcpNoDelay="true" bufferSize="2048" connectionLinger="-1"
connectionTimeout="20000" />
The following parameters may need tuning:
minProcessors. Number of threads created initially in the pool.
maxProcessors. Maximum number of threads that can ever be created in the
pool.
acceptCount. Controls the length of the queue of waiting requests when no more
threads are available from the pool to process the request.
connectionTimeout. Amount of time to wait before a URI is received from the
stream. Default is 20 seconds. This avoid problems where a client opens a
connection and does not send any data
tcpNoDelay. Set to true when data should be sent to the client without waiting for
the buffer to be full. This reduces latency at the cost of more packets being sent
over the network. The default is true.
enableLookups. Whether to perform a reverse DNS lookup to prevent snoofing.
Snoofing can cause problems when a DNS is misbehaving. The enableLookups
parameter can be turned off when you implicitly trust all clients.
connectionLinger. How long connections should linger after they are closed.
Informatica recommends using the default value: -1 (no linger).
In the PowerAnalyzer application, each web page can potentially have more than one
request to the application server. Hence, the maxProcessors should always be more
than the actual number of concurrent users. For an installation with 20 concurrent
users, a minProcessors of 5 and maxProcessors of 100 is a suitable value.
If the number of threads is too low, the following message may appear in the log files:
ERROR [ThreadPool] All threads are busy, waiting. Please increase maxThreads
JSP Optimization. To avoid having the application server compile JSP scripts when
they are executed for the first time, Informatica ships PowerAnalyzer with pre-compiled
JSPs.
<JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/web.xml

<servlet>
<servlet-name>jsp</servlet-name>
<servlet-class>org.apache.jasper.servlet.JspServlet</servlet-class>
<init-param>
<param-name>logVerbosityLevel</param-name>
<param-value>WARNING</param-value>
<param-name>development</param-name>
<param-value>false</param-value>
</init-param>
<load-on-startup>3</load-on-startup>
</servlet>
The following parameter may need tuning:
Set the development parameter to false in a production installation.
Database Connection Pool. PowerAnalyzer accesses the repository database to
retrieve metadata information. When it runs reports, it accesses the data sources to get
the report information. PowerAnalyzer keeps a pool of database connections for the
repository. It also keeps a separate database connection pool for each data source. To
optimize PowerAnalyzer database connections, you can tune the database connection
pools.
Repository Database Connection Pool. To optimize the repository database
connection pool, modify the JBoss configuration file:
<JBOSS_HOME>/server/informatica/deploy/<DB_Type>_ds.xml
The name of the file includes the database type. <DB_Type> can be Oracle, DB2, or
other databases. For example, for an Oracle repository, the configuration file name is
oracle_ds.xml.
<datasources>
<local-tx-datasource>
<jndi-name>jdbc/IASDataSource</jndi-name>
<connection-url> jdbc:informatica:oracle://aries:1521;SID=prfbase8</connection-
url>
<driver-class>com.informatica.jdbc.oracle.OracleDriver</driver-class>
<user-name>powera</user-name>
<password>powera</password>
<exception-sorter-class-
name>org.jboss.resource.adapter.jdbc.vendor.OracleExceptionSorter
</exception-sorter-class-name>
<min-pool-size>5</min-pool-size>
<max-pool-size>50</max-pool-size>
<blocking-timeout-millis>5000</blocking-timeout-millis>
<idle-timeout-minutes>1500</idle-timeout-minutes>
</local-tx-datasource>
</datasources>

min-pool-size. The minimum number of connections in the pool. (The pool is
lazily constructed, i.e. it will be empty until it is first accessed. Once used, it will
always have at least the min-pool-size connections.)
max-pool-size. The strict maximum size of the connection pool.
blocking-timeout-millis. The maximum time in milliseconds that a caller waits to
get a connection when no more free connections are available in the pool.
idle-timeout-minutes. The length of time an idle connection will remain in the
pool before it is used.
The max-pool-size value is recommended to be at least five more than maximum
number of concurrent users because there may be several scheduled reports running in
the background and each of them will need a database connection.
A higher value is recommended for idle-timeout-minutes. Since PowerAnalyzer accesses
the repository very frequently, it is inefficient to spend resources on checking for idle
connections and cleaning them out. Checking for idle connections may block other
threads that require new connections.
Data Source Database Connection Pool. Similar to the repository database
connection pools, the data source also has a pool of connections that PowerAnalyzer
dynamically creates as soon as the first client requests a connection.
The tuning parameters for these dynamic pools are present in following file:
<JBOSS_HOME>/bin/IAS.properties.file
#
# Datasource definition
#
dynapool.initialCapacity=5
dynapool.maxCapacity=50
dynapool.capacityIncrement=2
dynapool.allowShrinking=true
dynapool.shrinkPeriodMins=20
dynapool.waitForConnection=true
dynapool.waitSec=1
dynapool.poolNamePrefix=IAS_dynapool.refreshTestMinutes=60
datamart.defaultRowPrefetch=20
The following JBoss-specific parameters may need tuning:
dynapool.initialCapacity. The minimum number of initial connections in the data
source pool.
dynapool.maxCapacity. The maximum number of connections that the data
source pool may grow to.
dynapool.poolNamePrefix. This parameter is a prefix added to the dynamic JDB
pool name for identification purposes.

dynapool.waitSec. The maximum amount of time (in seconds) a client will wait
to grab a connection from the pool if none is readily available.
dynapool.refreshTestMinutes. This parameter determines the frequency at
which a health check is performed on the idle connections in the pool. This
should not be performed too frequently because it locks up the connection pool
and may prevent other clients from grabbing connections from the pool.
dynapool.shrinkPeriodMins. This parameter determines the amount of time (in
minutes) an idle connection is allowed to be in the pool. After this period, the
number of connections in the pool shrinks back to the value of its initialCapacity
parameter. This is done only if the allowShrinking parameter is set to true.
EJB Container
PowerAnalyzer uses EJBs extensively. It has more than 50 stateless session beans
(SLSB) and more than 60 entity beans (EB). In addition, there are six message-driven
beans (MDBs) that are used for the scheduling and real-time functionalities.
Stateless Session Beans (SLSB). For SLSBs, the most important tuning parameter is
the EJB pool. You can tune the EJB pool parameters in the following file:
<JBOSS_HOME>/server/Informatica/conf/standardjboss.xml.
<container-configuration>
<container-name> Standard Stateless SessionBean</container-name>
<call-logging>false</call-logging>
<invoker-proxy-binding-name>
stateless-rmi-invoker</invoker-proxy-binding-name>
<container-interceptors>
<interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor
</interceptor>
<interceptor> org.jboss.ejb.plugins.LogInterceptor</interceptor>
<interceptor>
org.jboss.ejb.plugins.SecurityInterceptor</interceptor>

<interceptor transaction="Container">
org.jboss.ejb.plugins.TxInterceptorCMT</interceptor>
<interceptor transaction="Container" metricsEnabled="true">
org.jboss.ejb.plugins.MetricsInterceptor</interceptor>
org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor
</interceptor>

<interceptor transaction="Bean">
org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor
</interceptor>
org.jboss.ejb.plugins.TxInterceptorBMT</interceptor>
<interceptor transaction="Bean" metricsEnabled="true">
<interceptor>

org.jboss.resource.connectionmanager.CachedConnectionInterceptor
</interceptor>
</container-interceptors>
<instance-pool>
org.jboss.ejb.plugins.StatelessSessionInstancePool</instance-pool>
<instance-cache></instance-cache>
<persistence-manager></persistence-manager>
<container-pool-conf>
<MaximumSize>100</MaximumSize>
</container-pool-conf>
</container-configuration>
MaximumSize. Represents the maximum number of objects in the pool. If
<strictMaximumSize> is set to true, then <MaximumSize> is a strict upper limit
for the number of objects that will be created. If <strictMaximumSize> is set to
false, the number of active objects can exceed the <MaximumSize> if there are
requests for more objects. However, only the <MaximumSize> number of
objects will be returned to the pool.
Additionally, there are two other parameters that you can set to fine tune the EJB
pool. These two parameters are not set by default in PowerAnalyzer. They can
be tuned after you have done proper iterative testing in PowerAnalyzer to
increase the throughput for high-concurrency installations.
strictMaximumSize. When the value is set to true, the <strictMaximumSize>
enforces a rule that only <MaximumSize> number of objects will be active. Any
subsequent requests will wait for an object to be returned to the pool.
strictTimeout. If you set <strictMaximumSize> to true, then <strictTimeout> is
the amount of time that requests will wait for an object to be made available in
the pool.
Message-Driven Beans (MDB). MDB tuning parameters are very similar to stateless
bean tuning parameters. The main difference is that MDBs are not invoked by clients.
Instead, the messaging system delivers messages to the MDB when they are available.
To tune the MDB parameters, modify the following configuration file:
<JBOSS_HOME>/server/informatica/conf/standardjboss.xml
<container-name>Standard Message Driven Bean</container-name>
<invoker-proxy-binding-name>message-driven-bean
</invoker-proxy-binding-name>
</interceptor>
<interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.RunAsSecurityInterceptor
</interceptor>


org.jboss.ejb.plugins.TxInterceptorCMT</interceptor>
<interceptor transaction="Container" metricsEnabled="true">
org.jboss.ejb.plugins.MetricsInterceptor
</interceptor>
org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor
</interceptor>

org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor
</interceptor>
org.jboss.ejb.plugins.MessageDrivenTxInterceptorBMT
</interceptor>
<interceptor transaction="Bean" metricsEnabled="true">
<interceptor>
</interceptor>
<instance-pool>org.jboss.ejb.plugins.MessageDrivenInstancePool
</instance-pool>
<instance-cache></instance-cache>
<persistence-manager></persistence-manager>
<strictMaximumSize> is set to true, then <MaximumSize> is a strict upper limit for the
number of objects that will be created. Otherwise, if <strictMaximumSize> is set to
false, the number of active objects can exceed the <MaximumSize> if there are request
for more objects. However, only the <MaximumSize> number of objects will be
returned to the pool.
Additionally, there are two other parameters that you can set to fine tune the EJB pool.
These two parameters are not set by default in PowerAnalyzer. They can be tuned after
you have done proper iterative testing in PowerAnalyzer to increase the throughput for
high-concurrency installations.
parameter enforces a rule that only <MaximumSize> number of objects will be
active. Any subsequent requests will wait for an object to be returned to the
pool.
the pool.

Enterprise Java Beans (EJB). PowerAnalyzer EJBs use BMP (bean-managed
persistence) as opposed to CMP (container-managed persistence). The EJB tuning
parameters are very similar to the stateless bean tuning parameters.
The EJB tuning parameters are in the following configuration file:
<JBOSS_HOME>/server/informatica/conf/standardjboss.xml.
<container-name>Standard BMP EntityBean</container-name>
<invoker-proxy-binding-name>entity-rmi-invoker
</invoker-proxy-binding-name>
<sync-on-commit-only>false</sync-on-commit-only>
</interceptor>
<interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.SecurityInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.TxInterceptorCMT
</interceptor>
<interceptor metricsEnabled="true">
<interceptor>org.jboss.ejb.plugins.EntityCreationInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityLockInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityInstanceInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityReentranceInterceptor
</interceptor>
<interceptor>
</interceptor>
<interceptor>
org.jboss.ejb.plugins.EntitySynchronizationInterceptor
</interceptor>
<instance-pool>org.jboss.ejb.plugins.EntityInstancePool
</instance-pool>
<instance-cache>org.jboss.ejb.plugins.EntityInstanceCache
</instance-cache>
<persistence-manager>org.jboss.ejb.plugins.BMPPersistenceManager
</persistence-manager>
<locking-policy>org.jboss.ejb.plugins.lock.QueuedPessimisticEJBLock
</locking-policy>
<container-cache-conf>
<cache-policy>org.jboss.ejb.plugins.LRUEnterpriseContextCachePolicy
</cache-policy>

<cache-policy-conf>
<min-capacity>50</min-capacity>
<max-capacity>1000000</max-capacity>
<overager-period>300</overager-period>
<max-bean-age>600</max-bean-age>
<resizer-period>400</resizer-period>
<max-cache-miss-period>60</max-cache-miss-period>
<min-cache-miss-period>1</min-cache-miss-period>
<cache-load-factor>0.75</cache-load-factor>
</cache-policy-conf>
</container-cache-conf>
<commit-option>A</commit-option>
<strictMaximumSize> is set to true, then <MaximumSize> is a strict upper limit for the
number of objects that will be created. Otherwise, if <strictMaximumSize> is set to
false, the number of active objects can exceed the <MaximumSize> if there are request
for more objects. However, only the <MaximumSize> number of objects will be
returned to the pool.
Additionally, there are two other parameters that you can set to fine tune the EJB pool.
These two parameters are not set by default in PowerAnalyzer. They can be tuned after
you have done proper iterative testing in PowerAnalyzer to increase the throughput for
high-concurrency installations.
parameter enforces a rule that only <MaximumSize> number of objects will be
active. Any subsequent requests will wait for an object to be returned to the
pool.
the pool.
RMI Pool
The JBoss Application Server can be configured to have a pool of threads to accept
connections from clients for remote method invocation (RMI). If you use the Java RMI
protocol to access the PowerAnalyzer API from other custom applications, you can
optimize the RMI thread pool parameters.
To optimize the RMI pool, modify the following configuration file:
<JBOSS_HOME>/server/informatica/conf/jboss-service.xml

<mbeancode="org.jboss.invocation.pooled.server.PooledInvoker"name="jboss:service
=invoker,type=pooled">
<attribute name="NumAcceptThreads">1</attribute>
<attribute name="MaxPoolSize">300</attribute>
<attribute name="ClientMaxPoolSize">300</attribute>
<attribute name="SocketTimeout">60000</attribute>
<attribute name="ServerBindAddress"></attribute>
<attribute name="ServerBindPort">0</attribute>
<attribute name="ClientConnectAddress"></attribute>
<attribute name="ClientConnectPort">0</attribute>
<attribute name="EnableTcpNoDelay">false</attribute>
<depends optional-attribute-name="TransactionManagerService">
jboss:service=TransactionManager
</depends>
</mbean>
NumAcceptThreads. The controlling threads used to accept connections from the
client.
MaxPoolSize. A strict maximum size for the pool of threads to service requests on
the server.
ClientMaxPoolSize. A strict maximum size for the pool of threads to service
requests on the client.
Backlog. The number of requests in the queue when all the processing threads
are in use.
EnableTcpDelay. Indicates whether information should be sent before the buffer
is full. Setting it to true may increase the network traffic because more packets
will be sent across the network.
WebSphere Application Server 5.1. The Tivoli Performance Viewer can be used to
observe the behavior of some of the parameters and arrive at a good settings.
Web Container
Navigate to Application Servers > [your_server_instance] > Web Container > Thread
Pool to tune the following parameters.
Minimum Size: Specifies the minimum number of threads to allow in the pool. The
default value of 10 is appropriate.
Maximum Size: Specifies the minimum number of threads to allow in the pool. For
a highly concurrent usage scenario (with a 3 VM load-balanced configuration)
the value of 50-60 has been determined to be opti mal.
Thread Inactivity Timeout: Specifies the number of milliseconds of inactivity that
should elapse before a thread is reclaimed. The default of 3500ms is considered
optimal.
Is Growable: Specifies whether the number of threads can increase beyond the
maximum size configured for the thread pool. Be sure to leave this option
unchecked. Also the maximum threads should be hard-limited to the value given
in the Maximum Size.

Note: In a load-balanced environment, there will be more than one server instance
that may possibly be spread across multiple machines. In such a scenario, be sure that
the changes have been properly propagated to all the server instances.
Transaction Services
Total transaction lifetime timeout: In certain circumstances (e.g. import of large
XML files) the default value of 120 seconds may not be sufficient and should be
increased. This parameter can be modified during runtime also.

Diagnostic Trace Services
Disable the trace in a production environment .
Navigate to Application Servers > [your_server_instance] > Administration
Services > Diagnostic Trace Service and make sure Enable Tracing is not
checked.
Debugging Services
Ensure that the tracing is disabled in a production environment
Navigate to Application Servers > [your_server_instance] > Logging and Tracing >
Diagnostic Trace Service > Debugging Service and make sure Startup is not
checked.
Performance Monitoring Services
This set of parameters is for monitoring the health of the Application Server. This
monitoring service tries to ping the application server after a certain interval; if the
server is found to be dead, then it tries to restart the server.
Navigate to Application Servers > [your_server_instance] > Process Definition >
MonitoringPolicy and tune the parameters according to a policy determined for each
PowerAnalyzer installation.
Note: The parameter Ping Timeout determines the time after which a no-response
from the server implies that it is faulty. Then the monitoring service attempts to kill the
server and restart it if Automatic restart is checked. Take care that Ping Timeout is
not set to too small a value.
Process Definitions (JVM Parameters)
For PowerAnalyzer with high number of concurrent users, Informatica recommends that
the minimum and the maximum heap size be set to the same values. This avoids the
heap allocation-reallocation expense during a high-concurrency scenario. Also, for a
high-concurrency scenario, Informatica recommends setting the values of minimum
heap and maximum heap size to at least 1000MB. Further tuning of this heap-size is
recommended after carefully studying the garbage collection behavior by turning on the
verbosegc option.

The following is a list of java parameters (for IBM JVM 1.4.1) that should NOT be
modified from the default values for PowerAnalyzer installation:
-Xnocompactgc: This parameter switches off heap compaction altogether.
Switching off heap compaction results in heap fragmentation. Since
PowerAnalyzer frequently allocates large objects, heap fragmentation can result
in OutOfMemory exceptions.
-Xcompactgc: Using this parameter leads to each garbage collection cycle
carrying out compaction, regardless of whether it's useful.
-Xgcthreads: This controls the number of garbage collection helper threads
created by the JVM during startup. The default is N-1 threads for an N-processor
machine. These threads provide the parallelism in parallel mark and parallel
sweep modes, which reduces the pause time during garbage collection.
-Xclassnogc: This disables collection of class objects.
-Xinitsh: This sets the initial size of the application-class system heap. The system
heap is expanded as needed and is never garbage collected.
You may want to alter the following parameters after carefully examining the
application server processes:
Navigate to Application Servers > [your_server_instance] > Process Definition >
Java Virtual Machine"
Verbose garbage collection: This option can be checked to turn on verbose
garbage collection. This can help in understanding the behavior of the garbage
collection for the application. It has a very low overhead on performance and
can be turned on even in the production environment.
Initial heap size: This is the ms value. Only the numeric value (without MB)
needs to be specified. For concurrent usage, the initial heap-size should be
started with a 1000 and, depending on the garbage collection behavior, can be
potentially increased up to 2000. A value beyond 2000 may actually reduce
throughput because the garbage collection cycles will take more time to go
through the large heap, even though the cycles may be occurring less
frequently.
Maximum heap size: This is the mx value. It should be equal to the Initial heap
size value.
RunHProf: This should remain unchecked in production mode, because it slows
down the VM considerably.
Debug Mode: This should remain unchecked in production mode, because it slows
down the VM considerably.
Disable JIT: This should remain unchecked (i.e., JIT should never be disabled).
Performance Monitoring Services
Be sure that performance monitoring services are not enabled in a production
environment.
Navigate to Application Servers > [your_server_instance] > Performance Monitoring
Services and be sure Startup is not checked.
Database Connection Pool

The repository database connection pool can be configured by navigating to JDBC
Providers > User-defined JDBC Provider > Data Sources > IASDataSource >
Connection Pools
The various parameters that may need tuning are:
Connection Timeout: The default value of 180 seconds should be good. This
implies that after 180 seconds, the request to grab a connection from the pool
will timeout. After it times out, PowerAnalyzer will throw an exception. In that
case, the pool size may need to be increased.
Max Connections: The maximum number of connections in the pool. Informatica
recommends a value of 50 for this.
Min Connections: The minimum number of connections in the pool. Informatica
recommends a value of 10 for this.
Reap Time: This specifies the frequency of pool maintenance thread. This should
not be set very high because when pool maintenance thread is running, it blocks
the whole pool and no process can grab a new connection form the pool. If the
database and the network are reliable, this should have a very high value (e.g.,
1000).
Unused Timeout: This specifies the time in seconds after which an unused
connection will be discarded until the pool size reaches the minimum size. In a
highly concurrent usage, this should be a high value. The default of 1800
seconds should be fine.
Aged Timeout: Specifies the interval in seconds before a physical connection is
discarded. If the database and the network are stable, there should not be a
reason for age timeout. The default is 0 (i.e., connections do not age). If the
database or the network connection to the repository database frequently comes
down (compared to the life of the AppServer), this may be used to age out the
stale connections.
Much like the repository database connection pools, the data source or data warehouse
databases also have a pool of connections that are created dynamically by
PowerAnalyzer as soon as the first client makes a request.
The tuning parameters for these dynamic pools are present in
<WebSphere_Home>/AppServer/IAS.properties file.
The following is a typical configuration:.
#
# Datasource definition
#
dynapool.initialCapacity=5
dynapool.maxCapacity=50
dynapool.capacityIncrement=2

dynapool.allowShrinking=true
dynapool.shrinkPeriodMins=20
dynapool.waitForConnection=true
dynapool.waitSec=1
dynapool.poolNamePrefix=IAS_
dynapool.refreshTestMinutes=60
datamart.defaultRowPrefetch=20
The various parameters that may need tuning are:
dynapool.initialCapacity: the minimum number of initial connections in the data-
source pool.
dynapool.maxCapacity: the maximum number of connections that the data-source
pool may grow up to.
dynapool.poolNamePrefix: this is just a prefix added to the dynamic JDB pool
name for identification purposes.
dynapool.waitSec: the maximum amount of time (in seconds) that a client will wait
to grab a connection from the pool if none is readily available.
dynapool.refreshTestMinutes: this determines the frequency at which a health
check on the idle connections in the pool is performed. Such checks should not
be performed too frequently because they lock up the connection pool and may
prevent other clients from grabbing connections from the pool.
dynapool.shrinkPeriodMins: this determines the amount of time (in minutes) an
idle connection is allowed to be in the pool. After this period, the number of
connections in the pool decreases (to its initialCapacity). This is done only if
allowShrinking is set to true.

Message Listeners Services
To process scheduled reports, PowerAnalyzer uses Message-Driven-Beans. It is possible
to run multiple reports within one schedule in parallel by increasing the number of
instances of the MDB catering to the Scheduler (InfScheduleMDB). Take care however,
not to increase the value to some arbitrarily high value since each report consumes
considerable resources (e.g., database connections, and CPU processing at both the
application-server and database server levels) and setting this to a very high value
may be actually detrimental to the whole system.
Navigate to Application Servers > [your_server_instance] > Message Listener Service
> Listener Ports > IAS_ScheduleMDB_ListenerPort .
The parameters that can be tuned are:
Maximum sessions: The default value is 1. On a highly-concurrent user scenario,
Informatica does not recommend going beyond 5.

Maximum messages: This should remain as 1. This implies that each report in a
schedule will be executed in a separate transaction instead of a batch. Setting it
to more than 1 may have unwanted effects like transaction timeouts, and the
failure of one report may cause all the reports in the batch to fail.

Plug-in Retry Intervals and Connect Timeouts
When PowerAnalyzer is set up in a clustered WebSphere environment, a plug-in is
normally used to perform the load-balancing between each server in the cluster. The
proxy http-server sends the request to the plug-in and the plug-in then routes the
request to the proper application-server.
The plug-in file can be generated automatically by navigating to
Environment > Update web server plugin configuration.
The default plug-in file contains ConnectTimeOut=0, which means that it relies on the
tcp timeout setting of the server. It is possible to have different timeout settings for
different servers in the cluster. The timeout settings implies that after the given
number of seconds if the server doesnt respond, then it is marked as down and the
request is sent over to the next available member of the cluster.
The RetryInterval parameter allows you to specify how long to wait before retrying a
server that is marked as down. The default value is 10 seconds. This means if a cluster
member is marked as down, the server will not try to send a request to the same
member for 10 seconds.


Upgrading PowerAnalyzer
Challenge
Seamlessly upgrade PowerAnalyzer from one release to another while safeguarding the
repository. This Best Practice describes the upgrade process from version 4.1.1 to
version 5.0, but the same general steps apply to any PowerAnalyzer upgrade.
Description
Upgrading PowerAnalyzer involves two steps:
1. Upgrading the PowerAnalyzer application.
2. Upgrading the PowerAnaylzer repository.
Steps Before The Upgrade
1. Backup the repository. To ensure a clean backup, shutdown PowerAnalyzer and
create the backup, following the steps in the PowerAnalyzer manual.
2. Restore the backed up repository into an empty database or a new schema. This
will ensure that you have a hot backup of the repository if, for some reason, the
upgrade fails.
Steps for upgrading PowerAnalyzer application
The upgrade process varies from application server to application server on which
PowerAnalyzer is hosted.
For WebLogic:
1. Install WebLogic 8.1 without uninstalling the existing Application
Server(WebLogic 6.1).
2. Install the PowerAnalyzer application on the new WebLogic 8.1 Application
Server, making sure to use a different port than the one used in the old
installation.. When prompted for repository, please choose the option of
existing repository and give the connection details of the database that hosts
the backed up repository of PowerAnalyzer 4.1.1.
3. When the installation is complete, use the Upgrade utility to connect to the
database that hosts the PowerAnalyzer 4.1.1 backed up repository and perform
the upgrade.

For Jboss and WebSphere:
1. Uninstall PowerAnalyzer4.1.1
2. Install PowerAnalyzer 5.0.
3. When prompted for a repository, choose the option of existing repository and
give the connection details of the database that hosts the backed up
PowerAnalyzer 4.1.1
4. Use the Upgrade utility and connect to the database that hosts the backed up
PowerAnalyzer 4.1.1 repository and perform the upgrade.
When the repository upgrade is complete, start PowerAnalyzer 5.0 and perform a
simple acceptance test.
You can use the following test case (or a subset of the following test case) as an
acceptance test).
1. Open a simple report
2. Open a cached report.
3. Open a report with filtersets.
4. Open a sectional report.
5. Open a workflow and also its nodes.
6. Open a report and drill through it.
When all the reports open without problems, your upgrade can be called complete.
Once the upgrade is complete, repeat the above process on the actual repository.
Note: This upgrade process creates two instances of PowerAnalyzer. So when the
upgrade is successful, uninstall the older version, following the steps in the
PowerAnalyzer manual.


Advanced Client Configuration Options
Challenge
Setting the Registry to ensure consistent client installations, resolve potential missing
or invalid license key issues, and change the Server Manager Session Log Editor to your
preferred editor.
Description
Ensuring Consistent Data Source Names
To ensure the use of consistent data source names for the same data sources across
the domain, the Administrator can create a single "official" set of data sources, then use
the Repository Manager to export that connection information to a file. You can then
distribute this file and import the connection information for each client machine.
Solution:
From Repository Manager, choose Export Registry from the Tools drop down
menu.
For all subsequent client installs, simply choose Import Registry from the Tools
drop down menu.

Resolving the Missing or Invalid License Key Issue
The missing or invalid license key error occurs when attempting to install
PowerCenter Client tools on NT 4.0 or Windows 2000 with a userid other than
Administrator.
This problem also occurs when the client software tools are installed under the
Administrator account, and subsequently a user with a non-administrator ID attempts
to run the tools. The user who attempts to log in using the normal non-administrator
userid will be unable to start the PowerCenter Client tools. Instead, the software will
display the message indicating that the license key is missing or invalid.
Solution:

While logged in as the installation user with administrator authority, use regedt32
to edit the registry.
Under HKEY_LOCAL_MACHINE open Software/Informatica/PowerMart Client Tools/.
From the menu bar, select Security/Permissions, and grant read access to the
users that should be permitted to use the PowerMart Client. (Note that the
registry entries for both PowerMart and PowerCenter Server and client tools are
stored as PowerMart Server and PowerMart Client tools.)

Changing the Session Log Editor
In PowerCenter versions 6.0 to 7.1.2, the session and workflow log editor defaults to
Wordpad within the workflow monitor client tool. To choose a different editor, just
select Tools>Options in the workflow monitor. On the general tab, browse for the
editor that you want.
For PowerCenter versions earlier than 6.0, the editor does not default to Wordpad
unless the wordpad.exe can be found in the path statement. Instead, a window
appears the first time a session log is viewed from the PowerCenter Server Manager,
prompting the user to enter the full path name of the editor to be used to view the logs.
Users often set this parameter incorrectly and must access the registry to change it.
Solution:
While logged in as the installation user with administrator authority, use regedt32
to go into the registry.
Move to registry path location: HKEY_CURRENT_USER
Software\Informatica\PowerMart Client Tools\[CLIENT VERSION]\Server
Manager\Session Files. From the menu bar, select View Tree and Data.
Select the Log File Editor entry by double clicking on it.
Replace the entry with the appropriate editor entry, i.e. typically WordPad.exe or
Write.exe.
Select Registry --> Exit from the menu bar to save the entry.
For PowerCenter version 7.1 and above, you should set the log editor option in the
Workflow Monitor. See fig 1 below.
Fig 1: Workflow Monitor Options Dialog Box used for setting the editor for workflow and
session logs.


Customize to Add a New Command Under a Tools Menu
Other tools are often needed during development and testing in addition to the
PowerCenter client tools. For example, a tool to query the database such as Enterprise
manager (SQL Server) or Toad (Oracle) is often needed. It is possible to add shortcuts
to executable programs from any client tools Tools dropdown menu. This allows for
quick access to these programs.
Solution:
Just choose Customize under the Tools menu and then add a new item. Once it is
added, browse to find the executable it will call.


After this is done once, you can easily call another program from your PowerCenter
client tools.
In the following example, TOAD can be called quickly from the Repository Manager tool.

Target Load Type
In PowerCenter versions 6.0 and earlier, every time a session was created, it defaulted
to be of type bulk. This was not necessarily what was desired and the session might
fail under certain conditions if it was not changed. In version 7.0 and above, there is a
property that can be set in the workflow manager to choose your default load type to
be bulk or normal.
Solution:
In the workflow manager tool, choose Tools > Options and go to the Miscellaneous
tab.
Click the button to be normal or bulk, as desired.
Click the Ok button and then close and open the workflow manager tool.

After this, every time a session is created, the target load type for all relational targets
will default to your choice.
Undocked Explorer Window
The Repository Navigator window sometimes becomes undocked. Docking it again can
be frustrating because double clicking on the window header does not put it back in
place.

Solution:
To get it docked again, right click in the white space of the Navigator window and
make sure that Allow Docking option is checked. If it is checked, just double click on
the title bar of the navigator window.


Advanced Server Configuration Options
Challenge
Configuring the Throttle Reader and File Debugging options, adjusting semaphore
settings in the UNIX environment, and configuring server variables.
Description

Configuring the Throttle Reader
If problems occur when running sessions, some adjustments at the server level can
help to alleviate issues or isolate problems.
One technique that often helps resolve hanging sessions is to limit the number of
reader buffers that use throttle reader. This is particularly effective if your mapping
contains many target tables, or if the session employs constraint-based loading. This
parameter closely manages buffer blocks in memory by restricting the number of blocks
that can be utilized by the reader.
Note for PowerCenter 5.x and above ONLY: If a session is hanging and it is
partitioned, it is best to remove the partitions before adjusting the throttle reader.
When a session is partitioned, the server makes separate connections to the source and
target for every partition. This can cause the server to manage many buffer blocks. If
the session still hangs, try adjusting the throttle reader.
Solution: To limit the number of reader buffers using throttle reader in NT/2000:
Access file
hkey_local_machine\system\currentcontrolset\services\powermart\parameters\
miscinfo.
Create a new string value with value name of 'ThrottleReader' and value data of
'10'.
To do the same thing in UNIX:
Add this line to .cfg file:
ThrottleReader=10

Configuring File Debugging Options
If problems occur when running sessions or if the PowerCenter Server has a stability
issue, help technical support to resolve the issue by supplying them with debug files.
To set the debug options on for NT/2000:
1. Select Start, Run, and type regedit
2. Go to hkey_local_machine, system, current_control_set, services, powermart,
miscInfo
3. Select Edit, then Add Value
4. Place "DebugScrubber" as the value then hit OK. Insert "4" as the value
5. Repeat steps 4 and 5, but use "DebugWriter", "DebugReader", "DebugDTM" with
all three set to "1"
To do the same in UNIX:
Insert the following entries in the pmserver.cfg file:
DebugScrubber=4
DebugWriter=1
DebugReader=1
DebugDTM=1

Adjusting Semaphore Settings
When the PowerCenter Server runs on a UNIX platform it uses operating system
semaphores to keep processes synchronized and prevent collisions when accessing
shared data structures You may need to increase these semaphore settings before
installing the server.
The number of semaphores required to run a session is 7. Most installations require
between 64 and 128 available semaphores, depending on the number of sessions the
server runs concurrently. This is in addition to any semaphores required by other
software, such as database servers.
The total number of available operating system semaphores is an operating system
configuration parameter, with a limit per user and system. The method used to change
the parameter depends on the operating system:
HP/UX: Use sam (1M) to change the parameters.
Solaris: Use admintool or edit /etc/system to change the parameters.
AIX: Use smit to change the parameters.

Setting Shared Memory and Semaphore Parameters
Informatica recommends setting the following parameters as high as possible for the
Solaris operating system. However, if you set these parameters too high, the machine
may not boot. Refer to the operating system documentation for parameter limits. Note

that different Unix operating systems will set these variables in different ways or maybe
self tuning:
Parameter Recommended Value for
Solaris
Description
SHMMAX 4294967295 Maximum size in bytes of a
shared memory segment.
SHMMIN 1 Minimum size in bytes of a
shared memory segment.
SHMMNI 100 Number of shared memory
identifiers.
SHMSEG 10 Maximum number of shared
memory segments that can be
attached by a process.
SEMMNS 200 Number of semaphores in the
system.
SEMMNI 70 Number of semaphore set
identifiers in the system.
SEMMNI determines the
number of semaphores that can
be created at any one time.
SEMMSL equal to or greater than the
value of the PROCESSES
initialization parameter
Maximum number of
semaphores in one semaphore
set. Must be equal to the
maximum number of
processes.
For example, you might add the following lines to the Solaris /etc/system file to
configure the UNIX kernel:
set shmsys:shminfo_shmmax = 4294967295
set shmsys:shminfo_shmmin = 1
set shmsys:shminfo_shmmni = 100
set shmsys:shminfo_shmseg = 10
set semsys:shminfo_semmns = 200
set semsys:shminfo_semmni = 70
Always reboot the system after configuring the UNIX kernel.
Configuring Server Variables
One configuration best practice is to properly configure and leverage server variables.
The benefits of using server variables include:
Ease of deployment from development environment to production environment.

Ease of switching sessions from one server machine to another without manually
editing all the sessions to change directory paths.
All the variables are related to directory paths used by server.
Approach
The WorkFlow Manager and pmrep, can be used edit the server configuration to set or
change the variables.
Each registered server has its own set of variables. The list is fixed, not user-extensible.
Server Variable Value
$PMRootDir (no default user must insert a
path)
$PMSessionLogDir $PMRootDir/SessLogs
$PMBadFileDir $PMRootDir/BadFiles
$PMCacheDir $PMRootDir/Cache
$PMTargetFileDir $PMRootDir/TargetFiles
$PMSourceFileDir $PMRootDir/SourceFiles
$PMExtProcDir $PMRootDir/ExtProc
$PMTempDir $PMRootDir/Temp
$PMSuccessEmailUser (no default user must insert a
path)
$PMFailureEmailUser (no default user must insert a
path)
$PMSessionLogCount 0
$PMSessionErrorThreshold 0
$PMWorkflowLogDir $PMRootDir/WorkflowLogs
$PMWorkflowLogCount 0
PMLookupFileDir $PMRootDir/LkpFiles
You can define server variables for each PowerCenter Server you register. Some server
variables define the path and directories for workflow output files and caches. By
default, the PowerCenter Server places output files in these directories when you run a
workflow. Other server variables define session/workflow attributes such as log file
count, email user, and error threshold.
The installation process creates directories (SessLogs, BadFiles, Cache, TargetFiles,
etc.) in the location where you install the PowerCenter Server. To use these directories
as the default location for the session output files, you must first set the server variable
$PMRootDir to define the path to the directories.
By using server variables, you simplify the process of changing the PowerCenter Server
that runs a workflow. If each workflow in a folder uses server variables, then when you
copy the folder to a production repository, the PowerCenter Server in production can
run the workflow using the server variables defined in the production repository. It is
not necessary to change the workflow/session properties in production again. To ensure
a workflow completes successfully, relocate any necessary file source or incremental
aggregation file to the default directories of the new PowerCenter Server.



Causes and Analysis of UNIX Core Files
Challenge
This Best Practice explains what UNIX core files are and why they are created, and
offers some tips on analyzing them.
Description
Fatal run-time errors in UNIX programs usually result in the termination of the UNIX
process by the operating system. Usually, when the operating system terminates a
process, a core dump file is also created, which can be used to analyze the reason for
the abnormal termination.
What is a Core File and What Causes it to be Created?
UNIX operating systems may terminate a process before its normal, expected exit for
several reasons. These reasons are typically for bad behavior by the program, and
include attempts to execute illegal or incorrect machine instructions, attempts to
allocate memory outside the memory space allocated to the program, attempts to write
to memory marked read-only by the operating system and other similar incorrect low
level operations. Most of these bad behaviors are caused by errors in programming
logic in the program.
UNIX may also terminate a process for some reasons that are not caused by
programming errors. The main examples of this type of termination are when a process
exceeds its CPU time limit, and when a process exceeds its memory limit.
When UNIX terminates a process in this way, it normally writes an image of the
processes memory to disk in a single file. These files are called core files, and are
intended to be used by a programmer to help determine the cause of the failure.
Depending on the UNIX version, the name of the file will be core, or in more recent
UNIX versions, it is core.nnnn where nnnn is the UNIX process ID of the process that
was terminated.
Core files are not created for normal runtime errors such as incorrect file permissions,
lack of disk space, inability to open a file or network connection, and other errors that a
program is expected to detect and handle. However, under certain error conditions a
program may not handle the error conditions correctly and may follow a path of
execution that causes the OS to terminate it and cause a core dump.

Mixing incompatible versions of UNIX, vendor, and database libraries can often trigger
behavior that causes unexpected core dumps. For example, using an odbc driver library
from one vendor and an odbc driver manager from another vendor may result in a core
dump if the libraries are not compatible. A similar situation can occur if a process is
using libraries from different versions of a database client, such as a mixed installation
of Oracle 8i and 9i. An installation like this should not exist, but if it does, core dumps
are often the result.
Core File Locations and Size Limits
A core file is written to the current working directory of the process that was
terminated. For PowerCenter, this is always the directory the server was started from.
For other applications, this may not be true.
UNIX also implements a per user resource limit on the maximum size of core files. This
is controlled by the ulimit command. If the limit is 0, then core files will not be created.
If the limit is less than the total memory size of the process, a partial core file will be
written. Refer the Best Practice on UNIX resource limits.
Analyzing Core Files
There is little information in a core file that is relevant to an end user; most of the
contents of a core file are only relevant to a developer, or someone who understands
the internals of the program that generated the core file. However, there are a few
things that an end user can do with a core file in the way of initial analysis.
The first step is to use the UNIX file command on the core, which will show which
program generated the core file:
file core.27431
core.27431: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from
'dd'
Core files can be generated by both the PowerCenter executables (i.e., pmserver,
pmrepserver, and pmdtm) as well as from other UNIX commands executed by the
server, typically from command tasks and per- or post-session commands. If a
PowerCenter process is terminated by the OS and a core is generated, the session or
server log typically indicates Process terminating on Signal/Exception as its last entry.
Using the pmstack Utility
Informatica provides a pmstack utility, which can automatically analyze a core file. If
the core file is from PowerCenter, it will generate a complete stack trace from the core
file, which can be sent to Informatica Customer support for further analysis. The track
contains everything necessary to further diagnose the problem. Core files themselves
are normally not useful on a system other than the one where they were generated.
The pmstack utility can be downloaded from the Informatica Support knowledge base
as article 13652, and from the support ftp server at tsftp.informatica.com. Once
downloaded, run pmstack with the c option, followed by the name of the core file:

$ pmstack -c core.21896
=================================
SSG pmstack ver 2.0 073004
=================================
Core info :
-rw------- 1 pr_pc_d pr_pc_d 58806272 Mar 29 16:28 core.21896
core.21896: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from
'pmdtm'

Process name used for analyzing the core : pmdtm
Generating stack trace, please wait..

Pmstack completed successfully
Please send file core.21896.trace to Informatica Technical Support
You can then look at the generated trace file or send it to support.
Pmstack also supports a p option, which can be used to extract a stack trace from a
running process. This is sometimes useful if the process appears to be hung, to
determine what the process is doing.


Determining Bottlenecks
Challenge
Because there are many variables involved in identifying and rectifying performance
bottlenecks, an efficient method for determining where bottlenecks exist is crucial to
good data warehouse management.
Description
The first step in performance tuning is to identify performance bottlenecks. Carefully
consider the following five areas to determine where bottlenecks exist; use a process of
elimination, investigating each area in the order indicated:
1. Target
2. Source
3. Mapping
4. Session
5. System
Attempt to isolate performance problems by running test sessions. You should be able
to compare the sessions' original performance with that of the tuned sessions
performance.
The swap method is very useful for determining the most common bottlenecks. It
involves the following five steps:
1. Make a temporary copy of the mapping, session and/or workflow that is to be
tuned, then tune the copy before making changes to the original.
2. Implement only one change at a time and test for any performance
improvements to gauge which tuning methods work most effectively in the
environment.
3. Document the change made to the mapping, session and/or workflow and the
performance metrics achieved as a result of the change. The actual execution
time may be used as a performance metric.
4. Delete the temporary mapping, session and/or workflow upon completion of
performance tuning.
5. Make appropriate tuning changes to mappings, sessions and/or workflows.
Target Bottlenecks

Relational targets
The most common performance bottleneck occurs when the PowerCenter Server writes
to a target database. This type of bottleneck can easily be identified with the following
procedure:
1. Make a copy of the original workflow
2. Configure the session in the test workflow to write to a flat file.
If session performance increases significantly when writing to a flat file, you have a
write bottleneck. Consider performing the following tasks to improve performance:
Drop indexes and key constraints.
Increase checkpoint intervals.
Use bulk loading.
Use external loading.
Increase database network packet size.
Optimize target databases.
Flat file targets
If the session targets a flat file, you probably do not have a write bottleneck. You can
optimize session performance by writing to a flat file target local to the PowerCenter
Server. If the local flat file is very large, you can optimize the write process by dividing
it among several physical drives.
Source Bottlenecks
Relational sources
If the session reads from a relational source, you can use a filter transformation, a read
test mapping, or a database query to identify source bottlenecks.
Using a Filter Transformation. Add a filter transformation in the mapping after each
source qualifier. Set the filter condition to false so that no data is processed past the
filter transformation. If the time it takes to run the new session remains about the
same, then you have a source bottleneck.
Using a Read Test Session. You can create a read test mapping to identify source
bottlenecks. A read test mapping isolates the read query by removing the
transformation in the mapping. Use the following steps to create a read test mapping:
1. Make a copy of the original mapping.
2. In the copied mapping, retain only the sources, source qualifiers, and any
custom joins or queries.
3. Remove all transformations.
4. Connect the source qualifiers to a file target.
Use the read test mapping in a test session. If the test session performance is similar to
the original session, you have a source bottleneck.

Using a Database Query. You can identify source bottlenecks by executing a read
query directly against the source database, To do so, follow these steps: Copy the read
query directly from the session log.
Run the query against the source database with a query tool such as SQL Plus. Measure
the query execution time and the time it takes for the query to return the first row.
If there is a long delay between the two time measurements, you have a source
bottleneck.
If your session reads from a relational source, review the following suggestions for
improving performance:
Optimize the query.
Create tempdb as in-memory database.
Use conditional filters.
Increase database network packet size.
Connect to Oracle databases using IPC protocol.
Flat file sources
If your session reads from a flat file source, you probably do not have a read
bottleneck. Tuning the Line Sequential Buffer Length to a size large enough to hold
approximately four to eight rows of data at a time (for flat files) may help when reading
flat file sources. Ensure the flat file source is local to the PowerCenter Server.
Mapping Bottlenecks
If you have eliminated the reading and writing of data as bottlenecks, you may have a
mapping bottleneck. Use the swap method to determine if the bottleneck is in the
mapping.
Add a Filter transformation in the mapping before each target definition. Set the filter
condition to false so that no data is loaded into the target tables. If the time it takes to
run the new session is the same as the original session, you have a mapping
bottleneck. You can also use the performance details to identify mapping bottlenecks.
High Rowsinlookupcache and High Errorrows counters indicate mapping bottlenecks.
Follow these steps to identify mapping bottlenecks:
Using a test mapping without transformations
1. Make a copy of the original mapping.
2. In the copied mapping, retain only the sources, source qualifiers, and any
custom joins or queries.
3. Remove all transformations.
4. Connect the source qualifiers to the target.
High Rowsinlookupcache counters. Multiple lookups can slow the session. You may
improve session performance by locating the largest lookup tables and tuning those
lookup expressions.

High Errorrows counters. Transformation errors affect session performance. If a
session has large numbers in any of the Transformation_errorrows counters, you may
improve performance by eliminating the errors.
For further details on eliminating mapping bottlenecks, refer to the Best Practice:
Tuning Mappings for Better Performance
Session Bottlenecks
Session performance details can be used to flag other problem areas. Create
performance details by selecting Collect Performance Data in the session properties
before running the session.
View the performance details through the Workflow Monitor as the session runs, or view
the resulting file. The performance details provide counters about each source qualifier,
target definition, and individual transformation to help you understand session and
mapping efficiency.
To watch the performance details during the session run:
Right-click the session in the Workflow Monitor
Choose Properties
Click the Properties tab in the details dialog box
To view the file, look for the file session_name.perf in the same directory as the
session log and open the file in any text editor
All transformations have basic counters that indicate the number of input row, output
rows, and error rows. Source qualifiers, normalizers, and targets have additional
counters indicating the efficiency of data moving into and out of buffers. Some
transformations have counters specific to their functionality. When reading
performance details, the first column displays the transformation name as it appears in
the mapping, the second column contains the counter name, and the third column hold
the resulting number or efficiency percentage.
Low buffer input and buffer output counters
If the BufferInput_efficiency and BufferOutput_efficiency counters are low for all
sources and targets, increasing the session DTM buffer pool size may improve
performance.
Aggregator, rank, and joiner readfromdisk and writetodisk counters
If a session contains Aggregator, Rank, or Joiner transformations, examine
each Transformation_readfromdisk and Transformation_writetodisk counter. If these
counters display any number other than zero, you can improve session performance by
increasing the index and data cache sizes.
If the session performs incremental aggregation, the Aggregator_readtodisk and
writetodisk counters display a number besides zero because the PowerCenter Server
reads historical aggregate data from the local disk during the session and writes to disk
when saving historical data. Evaluate the Aggregator_readtodisk and writetodisk

counters during the session. If the counters show any numbers other than zero during
the session run, you can increase performance by tuning the index and data cache
sizes.
PowerCenter Versions 6.x and above include the ability to assign memory allocation per
object. In versions earlier than 6.x, aggregators, ranks, and joiners were assigned at a
global/session level.
For further details on eliminating session bottlenecks, refer to the Best Practice: Tuning
Sessions for Better Performance and Tuning SQL Overrides and Environment for Better
Performance.
System Bottlenecks
After tuning the source, target, mapping, and session, you may also consider tuning the
system hosting the PowerCenter Server.
The PowerCenter Server uses system resources to process transformation, session
execution, and reading and writing data. The PowerCenter Server also uses system
memory for other data such as aggregate, joiner, rank, and cached lookup tables. You
can use system performance monitoring tools to monitor the amount of system
resources the Server uses and identify system bottlenecks.
Windows NT/2000
Use system tools such as the Performance and Processes tab in the Task Manager to
view CPU usage and total memory usage. You can also view more detailed
performance information by using the Performance Monitor in the Administrative Tools
on Windows.
UNIX
On UNIX, you can use system tools to monitor system performance. Use Lsattr E I
sys0 to view current system settings; Iostat to monitor loading operation for every disk
attached to the database server; Vmstat or sar w to monitor disk swapping actions;
and Sar u to monitor CPU loading.
For further information regarding system tuning, refer to the Best Practices:
Performance Tuning UNIX Systems and Performance Tuning Windows NT/2000
Systems.


Managing Repository Size
Challenge
The PowerCenter repository is expected to grow over time as new development and
production runs occur. Over time, the repository can be expected to grow to a size that
may start slowing performance of the repository or make backups increasingly difficult.
This Best Practice discusses methods to manage the size of the repository.
The release of PowerCenter version 7.x added several features that aid in managing the
repository size. Although the repository is slightly larger with version 7.x than it was
with the previous versions, the client tools have increased functionality to limit-out the
dependency on the size of the repository. PowerCenter versions earlier than 7.x
require more administration to keep the repository sizes manageable.
Description
Why should we manage the size of the repository?
The repository size affects the following:
DB backups and restores. If database backups are being performed, the size
required for the backup can be reduced. If PowerCenter backups are being
used, you can limit the what gets backed up.
Overall query time of the repository, which slows performance of the
repository over time. Analyzing tables on a regular basis can aid in your
repository table performance.
Migrations (i.e., copying from one repository to the next). Limit data transfer
between repositories to avoid locking up the repository for a lengthy period of
time. Some options are available to avoid transferring all run statistics when
migrating. A typical repository starts off small (i.e., 50-60MB for an empty
repository) and grows over time, to upwards of 1GB for a large repository. The
type of information stored in the repository includes:
o Versions
o Objects
o Run statistics
o Scheduling information
o Variables
Tips for Managing the Size of the Repository

Versions and Objects
Delete old versions or purged objects from the repository. Use your repository queries
in the client tools to generate reusable queries that can determine the out-of-date
versions and objects for removal.
Old versions and objects not only increase the size of the repository, but also make it
more difficult to manage further into the development cycle. Cleaning up the folders
makes it easier to determine what is valid and what is not.
Folders
Remove folders and objects that are no longer used or referenced. Unnecessary folders
increase the size of the repository backups. These folders should not be a part of
production but they may be found in development or test repositories.
Run Statistics
Remove old run statistics from the repository if you no longer need them. History is
important to determine trending, scaling, and performance tuning needs but you can
always generate reports based on the PowerCenter Metadata Reporter and save the
reports of the data you need. To remove the run statistics, go to the Repository
Manager and truncate the logs based on the dates.
Recommendations
Informatica strongly recommends upgrading to the latest version of PowerCenter since
the latest release includes such features as backup without run statistics, copying only
objects with no history, repository queries in the client tools, and so forth. The
repository size in version 7.x and above is larger than the previous versions of
PowerCenter but the added size does not dramatically affect the performance of the
repository. It is still advisable to analyze the tables or run statistics to optimize the
tables.
Informatica recommends against direct access to the repository tables or performing
deletes on them. Use the client tools unless otherwise advised by Informatica.


Organizing and Maintaining Parameter Files & Variables
Challenge
Organizing variables and parameters in Parameter files and maintaining Parameter files
for ease of use.
Description
Parameter files are a means of providing run time values for parameters and variables
defined in workflow, worklet, session, mapplet or mapping. A parameter file can have
values for more than one workflows, sessions and mappings, and can be created using
text editors such as notepad or vi.
Variables values are stored in the repository and can be changed within mappings and.
However, variable values specified in parameter files supersede values stored in the
repository. The values stored in the repository can be cleared or reset using workflow
manager.
Parameter File Contents
A Parameter File contains the values for variables and parameters. Although a
parameter file can contain values for more than one workflow (or session), it is
advisable to build a parameter file to contain values for a single or logical group of
workflows For ease of administration. When using the command line mode to execute
workflows, multiple parameter files can also be configured and used for a single
workflow if the same workflow needs to be run with different parameters.
Parameter File Name
Name the Parameter File the same as the workflow name with a suffix of .par. This
helps in identifying and linking the parameter file to a workflow.
Parameter File Order Of Precedence
While it is possible to assign Parameter Files to a session and a workflow, it is important
to note that a file specified at the workflow level will always supersede files specified at
session levels.
Parameter File Location

Place the Parameter Files in directory that can be accessed using the server variable.
This helps to move the sessions and workflows to a different server without modifying
workflow or session properties. You can override the location and name of parameter
file specified in the session or workflow while executing workflows via the pmcmd
command.
The following points apply to both Parameter and Variable files, however these are
more relevant to Parameters and Parameter files, and are therefore detailed
accordingly.
Multiple Parameter Files for a workflow
To run a workflow with different sets of parameter values during every run:
a. create multiple parameter files with unique names.
b. change the parameter file name (to match the parameter file name defined in
Session or workflow properties). This can be done manually or by using a pre-
session shell (or batch script).
c. run the workflow.
Alternatively, run the workflow using pmcmd with the -paramfile option in place of
steps b and c.
Generating Parameter files
Based on requirements, you can obtain the values for certain parameters from
relational tables or generate them programmatically. In such cases, the parameter files
can be generated dynamically using shell (or batch scripts) or using Informatica
mappings and sessions.
Consider a case where a session has to be executed only on specific dates (e.g., the
last working day of every month), which are listed in a table. You can create the
parameter file containing the next run date (extracted from the table) in more than one
way.
Method 1:
1. The workflow is configured to use a parameter file
2. Workflow has a decision task before running the session comparing the
Current System date against the date in the parameter file. See Figure 1.
3. Use a shell (or batch) script to create a parameter file. Use an SQL query to
extract a single date, which is greater than the System Date (today) from the
table and write it to a file with required format.
4. The shell script uses pmcmd to run the workflow
5. The shell script is scheduled using cron or an external scheduler to run daily.
See Figure 2.


Figure 1. Shell script to generate parameter file

Figure 2 Generated parameter file
Method 2:
1. The Workflow is configured to use a parameter file.
2. The initial value for the data parameter is the first date on which the workflow is
to run.
3. The workflow has a decision task before running the session, comparing the
Current System date against the date in the parameter file

4. The last task in the workflow generates the parameter file for the next run of the
workflow (using a command task calling a shell script) or a session task, which
uses a mapping. This task extracts a date that is greater than the system date
(today) from the table and writes into parameter file, in the required format.
5. Schedule the workflow using Informatica Scheduler, to run daily.

Figure 3 Workflow and parameter definition
Parameter file templates
In some other cases, the parameter values change between runs, but the change can
be incorporated into the parameter files programmatically. There is no need to maintain
separate parameter files for each run.
Consider, for example, a service provider who gets the source data for each client from
flat files located in client specific directories and writes processed data into global
database. The source data structure, target data structure, and processing logic are all
same. The log file for each client run has to be preserved in a client-specific directory.
The directory names have the client id as part of directory structure (e.g.,
/app/data/Client_ID/)
You can complete the work for all clients using a set of mappings, sessions, and a
workflow, with one parameter file per client. However, the number of parameter files
may become cumbersome to manage when the number of clients increases.

In such cases, a parameter file template (i.e., a parameter file containing values for
some parameters and placeholders for others) may prove useful. Use a shell (or batch)
script at run time to create actual parameter file (for a specific client), replacing the
placeholders with actual values, and then execute the workflow using pmcmd. See
Figure 4.
[PROJ_DP.WF:Cleint_Data]
$InputFile_1=/app/data/Client_ID/input/client_info.dat
$LogFile=/app/data/Client_ID/logfile/wfl_client_data_curdate.log

Figure 4 Parameter File Template
Using a script, replace Client_ID and curdate to actual values before executing the
workflow.


Performance Tuning Databases (Oracle)
Challenge
Database tuning can result in tremendous improvement in loading performance. This
Best Practice covers tips on tuning Oracle.
Description

Performance Tuning Tools
Oracle offers many tools for tuning an Oracle instance. Most DBAs are already familiar
with these tools, so weve included only a short description of some of the major ones
here.
V$ Views
V$ views are dynamic performance views that provide real-time information on
database activity, enabling the DBA to draw conclusions about database performance.
Because SYS is the owner of these views, only SYS can query them. Keep in mind that
querying these views impacts database performance; with each query having an
immediate hit. With this in mind, carefully consider which users should be granted the
privilege to query these views. You can grant viewing privileges with either the SELECT
privilege, which allows a user to view for individual V$ views or the SELECT ANY
TABLE privilege, which allows the user to vi ew all V$ views. Using the SELECT ANY
TABLE option requires the O7_DICTIONARY_ACCESSIBILITY parameter be set to
TRUE, which allows the ANY keyword to apply to SYS owned objects.
Explain Plan
Explain Plan, SQL Trace, and TKPROF are powerful tools for revealing bottlenecks and
developing a strategy to avoid them.
Explain Plan allows the DBA or developer to determine the execution path of a block of
SQL code. The SQL in a source qualifier or in a lookup that is running for a long time
should be generated and copied to SQL*PLUS or other SQL tool and tested to avoid
inefficient execution of these statements. Review the PowerCenter session log for long
initialization time (an indicator that the source qualifier may need tuning) and the time

it takes to build a lookup cache to determine if the SQL for these transformations
should be tested.
SQL Trace
SQL Trace extends the functionality of Explain Plan by providing statistical information
about the SQL statements executed in a session that has tracing enabled. This utility is
run for a session with the ALTER SESSION SET SQL_TRACE = TRUE statement.
TKPROF
The output of SQL Trace is provided in a dump file that is difficult to read. TKPROF
formats this dump file into a more understandable report.
UTLBSTAT & UTLESTAT
Executing UTLBSTAT creates tables to store dynamic performance statistics and begins
the statistics collection process. Run this utility after the database has been up and
running (for hours or days). Accumulating statistics may take time, so you need to run
this utility for a long while and through several operations (i.e., both loading and
querying).
UTLESTAT ends the statistics collection process and generates an output file called
report.txt. This report should give the DBA a fairly complete idea about the level of
usage the database experiences and reveal areas that should be addressed.
Disk I/O
Disk I/O at the database level provides the highest level of performance gain in most
systems. Database files should be separated and identified. Rollback files should be
separated onto their own disks because they have significant disk I/O. Co-locate tables
that are heavily used with tables that are rarely used to help minimize disk contention.
Separate indexes so that when queries run indexes and tables, they are not fighting for
the same resource. Also be sure to implement disk striping; this, or RAID technology
can help immensely in reducing disk contention. While this type of planning is time
consuming, the payoff is well worth the effort in terms of performance gains.
Memory and Processing
Memory and processing configuration is done in the init.ora file. Because each database
is different and requires an experienced DBA to analyze and tune it for optimal
performance, a standard set of parameters to optimize PowerCenter is not practical and
will probably never exist.
TIP
Changes made in the init.ora file will take effect after a restart of the instance.
Use svrmgr to issue the commands shutdown and startup (eventually
shutdown immediate) to the instance. Note svrmgr is no longer available as
of Oracle 9i because Oracle is moving to a web based Server Manager in
Oracle 10g. If you are on Oracle 9i either install Oracle client tools and log
onto Oracle Enterprise Manager. Some other tools like DBArtisan expose the

initialization parameters.

The settings presented here are those used in a 4-CPU AIX server running Oracle 7.3.4
set to make use of the parallel query option to facilitate parallel processing of queries
and indexes. Weve also included the descriptions and documentation from Oracle for
each setting to help DBAs of other (non-Oracle) systems to determine what the
commands do in the Oracle environment to facilitate setting their native database
commands and settings in a similar fashion.
HASH_AREA_SIZE = 16777216

Default value: 2 times the value of SORT_AREA_SIZE
Range of values: any integer
This parameter specifies the maximum amount of memory, in bytes, to be used for
the hash join. If this parameter is not set, its value defaults to twice the value of
the SORT_AREA_SIZE parameter.
The value of this parameter can be changed without shutting down the Oracle
instance by using the ALTER SESSION command. (Note: ALTER SESSION refers
to the Database Administration command issued at the svrmgr command
prompt).
HASH_JOIN_ENABLED
o In Oracle 7 and Oracle 8 the hash_join_enabled parameter must be set to
true.
o In Oracle 8i and above hash_join_enabled=true is the default value
HASH_MULTIBLOCK_IO_COUNT
o Allows multiblock reads against the TEMP tablespace
o It is advisable to set the NEXT extentsize to greater than the value for
hash_multiblock_io_count to reduce disk I/O
o This is the same behavior seen when setting the
db_file_multiblock_read_count parameter for data tablespaces except
this one applies only to multiblock access of segments of TEMP
Tablespace
STAR_TRANSFORMATION_ENABLED
o Determines whether a cost based query transformation will be applied to
star queries
o When set to TRUE optimizer will consider performing a cost based query
transformation on the n-way join table
OPTIMIZER_INDEX_COST_ADJ
o Numeric parameter set between 0 and 1000 (default 1000)
o This parameter lets you tune the optimizer behavior for access path
selection to be more or less index friendly
Optimizer_percent_parallel=33
This parameter defines the amount of parallelism that the optimizer uses in its cost
functions. The default of 0 means that the optimizer chooses the best serial plan. A
value of 100 means that the optimizer uses each object's degree of parallelism in
computing the cost of a full table scan operation.

The value of this parameter can be changed without shutting down the Oracle instance
by using the ALTER SESSION command. Low values favor indexes, while high values
favor table scans.
Cost-based optimization is always used for queries that reference an object with a
nonzero degree of parallelism. For such queries, a RULE hint or optimizer mode or goal
is ignored. Use of a FIRST_ROWS hint or optimizer mode overrides a nonzero setting of
OPTIMIZER_PERCENT_PARALLEL.
parallel_max_servers=40

Used to enable parallel query.
Initially not set on Install.
Maximum number of query servers or parallel recovery processes for an instance.
Parallel_min_servers=8

o Used to enable parallel query.
o Initially not set on Install.
o Minimum number of query server processes for an instance. This is also the
number of query server processes Oracle creates when the instance is
started.
SORT_AREA_SIZE=8388608

Default value: Operating system-dependent
Minimum value: the value equivalent to two database blocks
This parameter specifies the maximum amount, in bytes, of Program Global Area
(PGA) memory to use for a sort. After the sort is complete and all that remains
to do is to fetch the rows out, the memory is released down to the size specified
by SORT_AREA_RETAINED_SIZE. After the last row is fetched out, all memory is
freed. The memory is released back to the PGA, not to the operating system.
Increasing SORT_AREA_SIZE size improves the efficiency of large sorts. Multiple
allocations never exist; there is only one memory area of SORT_AREA_SIZE for
each user process at any time.
The default is usually adequate for most database operations. However, if very
large indexes are created, this parameter may need to be adjusted. For
example, if one process is doing all database access, as in a full database
import, then an increased value for this parameter may speed the import,
particularly the CREATE INDEX statements.
IPC as an Alternative to TCP/IP on UNIX
On an HP/UX server with Oracle as a target (i.e., PMServer and Oracle target on same
box), using an IPC connection can significantly reduce the time it takes to build a
lookup cache. In one case, a fact mapping that was using a lookup to get five columns
(including a foreign key) and about 500,000 rows from a table was taking 19 minutes.
Changing the connection type to IPC reduced this to 45 seconds. In another mapping,

the total time decreased from 24 minutes to 8 minutes for ~120-130 bytes/row,
500,000 row write (array inserts), primary key with unique index in place. Performance
went from about 2Mb/min (280 rows/sec) to about 10Mb/min (1360 rows/sec).
A normal tcp (network tcp/ip) connection in tnsnames.ora would look like this:
DW.armafix =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS =
(PROTOCOL =TCP)
(HOST = armafix)
(PORT = 1526)
)
)
(CONNECT_DATA=(SID=DW)
)
)
Make a new entry in the tnsnames like this, and use it for connection to the local Oracle
instance:
DWIPC.armafix =
(DESCRIPTION =
(ADDRESS =
(PROTOCOL=ipc)
(KEY=DW)
)
(CONNECT_DATA=(SID=DW))
)
Improving Data Load Performance
Alternative to Dropping and Reloading Indexes
Dropping and reloading indexes during very large loads to a data warehouse is often
recommended but there is seldom any easy way to do this. For example, writing a SQL
statement to drop each index, then writing another SQL statement to rebuild it can be a
very tedious process.
Oracle 7 (and above) offers an alternative to dropping and rebuilding indexes by
allowing you to disable and re-enable existing indexes. Oracle stores the name of each
index in a table that can be queried. With this in mind, it is an easy matter to write a
SQL statement that queries this table. then generate SQL statements as output to
disable and enable these indexes.
Run the following to generate output to disable the foreign keys in the data warehouse:
SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE CONSTRAINT ' ||
CONSTRAINT_NAME || ' ;'

FROM USER_CONSTRAINTS
WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')
AND CONSTRAINT_TYPE = 'R'
This produces output that looks like:
ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011077
;
ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011075
;
ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011060
;
ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011059
;
ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT
SYS_C0011133 ;
SYS_C0011134 ;
SYS_C0011131 ;
Dropping or disabling primary keys will also speed loads. Run the results of this SQL
statement after disabling the foreign key constraints:
SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;'
AND CONSTRAINT_TYPE = 'P'
This produces output that looks like:
ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE PRIMARY KEY ;
ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE PRIMARY KEY ;
ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE PRIMARY KEY ;
Finally, disable any unique constraints with the following:

SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;'
AND CONSTRAINT_TYPE = 'U'
ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011070 ;
SYS_C0011071 ;
Save the results in a single file and name it something like DISABLE.SQL
To re-enable the indexes, rerun these queries after replacing DISABLE with ENABLE.
Save the results in another file with a name such as ENABLE.SQL and run it as a post-
session command.
Re-enable constraints in the reverse order that you disabled them. Re-enable the
unique constraints first, and re-enable primary keys before foreign keys.
TIP
Dropping or disabling foreign keys will often boost loading, but this also slows
queries (such as lookups) and updates. If you do not use lookups or updates
on your target tables you should get a boost by using this SQL statement to
generate scripts. If you use lookups and updates (especially on large tables),
you can exclude the index that will be used for the lookup from your script.
You may want to experiment to determine which method is faster.

Optimizing Query Performance
Oracle Bitmap Indexing
With version 7.3.x, Oracle added bitmap indexing to supplement the traditional b-tree
index. A b-tree index can greatly improve query performance on data that has high
cardinality or contains mostly unique values, but is not much help for low
cardinality/highly duplicated data and may even increase query time. A typical example
of a low cardinality field is gender it is either male or female (or possibly unknown).
This kind of data is an excellent candidate for a bitmap index, and can significantly
improve query performance.
Keep in mind, however, that b-tree indexing is still the Oracle default. If you dont
specify an index type when creating an index, Oracle will default to b-tree. Also note
that for certain columns, bitmaps will be smaller and faster to create than a b-tree
index on the same column.
Bitmap indexes are suited to data warehousing because of their performance, size, and
ability to create and drop very quickly. Since most dimension tables in a warehouse
have nearly every column indexed, the space savings is dramatic. But it is important to

note that when a bitmap-indexed column is updated, every row associated with that
bitmap entry is locked, making bit-map indexing a poor choice for OLTP database tables
with constant insert and update traffic. Also, bitmap indexes are rebuilt after each DML
statement (e.g., inserts and updates), which can make loads very slow. For this reason,
it is a good idea to drop or disable bitmap indexes prior to the load and re-create or re-
enable them after the load.
The relationship between Fact and Dimension keys is another example of low
cardinality. With a b-tree index on the Fact table, a query processes by joining all the
Dimension tables in a Cartesian product based on the WHERE clause, then joins back to
the Fact table. With a bitmapped index on the Fact table, a star query may be created
that accesses the Fact table first followed by the Dimension table joins, avoiding a
Cartesian product of all possible Dimension attributes. This star query access method
is only used if the STAR_TRANSFORMATION_ENABLED parameter is equal to TRUE in
the init.ora file and if there are single column bitmapped indexes on the fact table
foreign keys. Creating bitmap indexes is similar to creating b-tree indexes. To specify a
bitmap index, add the word bitmap between create and index. All other syntax is
identical.
Bitmap indexes
drop index emp_active_bit;
drop index emp_gender_bit;
create bitmap index emp_active_bit on emp (active_flag);
create bitmap index emp_gender_bit on emp (gender);
B-tree indexes
drop index emp_active;
drop index emp_gender;
create index emp_active on emp (active_flag);
create index emp_gender on emp (gender);
Information for bitmap indexes in stored in the data dictionary in dba_indexes,
all_indexes, and user_indexes with the word BITMAP in the Uniqueness column rather
than the word UNIQUE. Bitmap indexes cannot be unique.
To enable bitmap indexes, you must set the following items in the instance initialization
file:
compatible = 7.3.2.0.0 # or higher
event = "10111 trace name context forever"

Also note that the parallel query option must be installed in order to create bitmap
indexes. If you try to create bitmap indexes without the parallel query option, a syntax
error will appear in your SQL statement; the keyword bitmap won't be recognized.
TIP
To check if the parallel query option is installed, start and log into SQL*Plus.
If the parallel query option is installed, the word parallel appears in the
banner text.

Index Statistics
Table method
Index statistics are used by Oracle to determine the best method to access tables and
should be updated periodically as normal DBA procedures. The following will improve
query results on Fact and Dimension tables (including appending and updating records)
by updating the table and index statistics for the data warehouse:
The following SQL statement can be used to analyze the tables in the database:
SELECT 'ANALYZE TABLE ' || TABLE_NAME || ' COMPUTE STATISTICS;'
FROM USER_TABLES
This generates the following results:
ANALYZE TABLE CUSTOMER_DIM COMPUTE STATISTICS;
ANALYZE TABLE MARKET_DIM COMPUTE STATISTICS;
ANALYZE TABLE VENDOR_DIM COMPUTE STATISTICS;
The following SQL statement can be used to analyze the indexes in the database:
SELECT 'ANALYZE INDEX ' || INDEX_NAME || ' COMPUTE STATISTICS;'
FROM USER_INDEXES
This generates the following results:
ANALYZE INDEX SYS_C0011125 COMPUTE STATISTICS;

Save these results as a SQL script to be executed before or after a load.
Schema method
Another way to update index statistics is to compute indexes by schema rather than by
table. If data warehouse indexes are the only indexes located in a single schema, then
you can use the following command to update the statistics:
EXECUTE SYS.DBMS_UTILITY.Analyze_Schema ('BDB', 'compute');
In this example, BDB is the schema for which the statistics should be updated. Note
that the DBA must grant the execution privilege for dbms_utility to the database user
executing this command.
TIP: These SQL statements can be very resource intensive, especially for very large
tables. For this reason, we recommend running them at off-peak times when no other
process is using the database. If you find the exact computation of the statistics
consumes too much time, it is often acceptable to estimate the statistics rather than
compute them. Use estimate instead of compute in the above examples.
TIP
These SQL statements can be very resource intensive, especially for very
large tables. For this reason, we recommend running them at off-peak times
when no other process is using the database. If you find the exact
computation of the statistics consumes too much time, it is often acceptable to
estimate the statistics rather than compute them. Use estimate instead of
compute in the above examples.
Parallelism
Parallel execution can be implemented at the SQL statement, database object, or
instance level for many SQL operations. The degree of parallelism should be identified
based on the number of processors and disk drives on the server, with the number of
processors being the minimum degree.
SQL Level Parallelism
Hints are used to define parallelism at the SQL statement level. The following examples
demonstrate how to utilize four processors:
SELECT /*+ PARALLEL(order_fact,4) */ ;
SELECT /*+ PARALLEL_INDEX(order_fact, order_fact_ixl,4) */ ;
TIP
When using a table alias in the SQL Statement, be sure to use this alias in the
hint. Otherwise, the hint will not be used, and you will not receive an error
message.

Example of improper use of alias:
SELECT /*+PARALLEL (EMP, 4) */ EMPNO, ENAME
FROM EMP A
Here, the parallel hint will not be used because of the used alias A for table EMP. The
correct way is:
SELECT /*+PARALLEL (A, 4) */ EMPNO, ENAME
FROM EMP A
Table Level Parallelism
Parallelism can also be defined at the table and index level. The following example
demonstrates how to set a tables degree of parallelism to four for all eligible SQL
statements on this table:
ALTER TABLE order_fact PARALLEL 4;
Ensure that Oracle is not contending with other processes for these resources or you
may end up with degraded performance due to resource contention.
Additional Tips
Executing Oracle SQL scripts as pre and post session commands on UNIX
You can execute queries as both pre- and post-session commands. For a UNIX
environment, the format of the command is:
sqlplus s user_id/password@database @ script_name.sql
For example, to execute the ENABLE.SQL file created earlier (assuming the data
warehouse is on a database named infadb), you would execute the following as a
post-session command:
sqlplus -s pmuser/pmuser@infadb @ /informatica/powercenter/Scripts/ENABLE.SQL
In some environments, this may be a security issue since both username and password
are hard-coded and unencrypted. To avoid this, use the operating systems
authentication to log onto the database instance.
In the following example, the Informatica id pmuser is used to log onto the Oracle
database. Create the Oracle user pmuser with the followi ng SQL statement:
CREATE USER PMUSER IDENTIFIED EXTERNALLY
DEFAULT TABLESPACE . . .
TEMPORARY TABLESPACE . . .
In the following pre-session command, pmuser (the id Informatica is logged onto the
operating system as) is automatically passed from the operating system to the
database and used to execute the script:

sqlplus -s /@infadb @/informatica/powercenter/Scripts/ENABLE.SQL
You may want to use the init.ora parameter os_authent_prefix to distinguish between
normal oracle-users and external-identified ones.
DRIVING_SITE Hint
If the source and target are on separate instances, the Source Qualifier transformation
should be executed on the target instance.
For example, you want to join two source tables (A and B) together, which may reduce
the number of selected rows. However, Oracle fetches all of the data from both tables,
moves the data across the network to the target instance, then processes everything
on the target instance. If either data source is large, this causes a great deal of network
traffic. To force the Oracle optimizer to process the join on the source instance, use the
Generate SQL option in the source qualifier and include the driving_site hint in the
SQL statement as:
SELECT /*+ DRIVING_SITE */ ;


Performance Tuning Databases (SQL Server)
Challenge
Best Practice covers tips on tuning SQL Server.
Description
Proper tuning of the source and target database is a very important consideration to
the scalability and usability of a business analytical environment. Managing
performance on an SQL Server encompasses the following points.
Manage system memory usage (RAM caching)
Create and maintain good indexes
Partition large data sets and indexes
Monitor disk I/O subsystem performance
Tune applications and queries
Optimize active data
Manage RAM Caching
Managing random access memory (RAM) buffer cache is a major consideration in any
database server environment. Accessing data in RAM cache is much faster than
accessing the same Information from disk. If database I/O (input/output operations to
the physical disk subsystem) can be reduced to the minimal required set of data and
index pages, these pages will stay in RAM longer. Too much unneeded data and index
information flowing into buffer cache quickly pushes out valuable pages. The primary
goal of performance tuning is to reduce I/O so that buffer cache is best utilized.
Several settings in SQL Server can be adjusted to take advantage of SQL Server RAM
usage:
Max async I/O is used to specify the number of simultaneous disk I/O operations
(???) that SQL Server can submit to the operating system. Note that this setting
is automated in SQL Server 2000
SQL Server allows several selectable models for database recovery, these include:
o Full Recovery
o Bulk-Logged Recovery
o Simple Recovery

Create and maintain good indexes
A key factor in maintaining minimum I/O for all database queries is ensuring that good
indexes are created and maintained
Partition large data sets and indexes
To reduce overall I/O contention and improve parallel operations, consider partitioning
table data and indexes. Multiple techniques for achieving and managing partitions using
SQL Server 2000 are addressed in this chapter.
Tune applications and queries
This becomes especially important when a database server will be servicing requests
from hundreds or thousands of connections through a given application. Because
applications typically determine the SQL queries that will be executed on a database
server, it is very important for application developers to understand SQL Server
architectural basics and how to take full advantage of SQL Server indexes to minimize
I/O.
Partitioning for Performance
The simplest technique for creating disk I/O parallelism is to use hardware partitioning
and create a single "pool of drives" that serves all SQL Server database files except
transaction log files, which should always be stored on physically separate disk drives
dedicated to log files only. See Microsoft Documentation for installation procedures.
Objects For Partitioning Consideration
The following areas of SQL Server activity can be separated across different hard
drives, RAID controllers, and PCI channels (or combinations of the three):
Transaction logs
Tempdb
Database
Tables
Nonclustered Indexes
Note In SQL Server 2000, Microsoft introduced enhancements to distributed partitioned
views that enable the creation of federated databases (commonly referred to as scale-
out), which spread resource load and I/O activity across multiple servers. Federated
databases are appropriate for some high-end online analytical processing (OLTP)
applications, but this approach is not recommended for addressing the needs of a data
warehouse.
Segregating the Transaction Log
Transaction log files should be maintained on a storage device physically separate from
devices that contain data files. Depending on your database recovery model setting,
most update activity generates both data device activity and log activity. If both are set

up to share the same device, the operations to be performed will compete for the same
limited resources. Most installations benefit from separating these competing I/O
activities.
Segregating tempdb
SQL Server creates a database, tempdb, on every server instance to be used by the
server as a shared working area for various activities, including temporary tables,
sorting, processing subqueries, building aggregates to support GROUP BY or ORDER BY
clauses, queries using DISTINCT (temporary worktables have to be created to remove
duplicate rows), cursors, and hash joins.
To move the tempdb database, use the ALTER DATABASE command to change the
physical file location of the SQL Server logical file name associated with tempdb. For
example, to move tempdb and its associated log to the new file locations E:\mssql7
and C:\temp, use the following commands:
alter databasetempdbmodifyfile(name='tempdev',filename=
'e:\mssql7\tempnew_location.mDF')
alter databasetempdbmodifyfile(name='templog',filename=
'c:\temp\tempnew_loglocation.mDF')
The master database, msdb, and model databases are not used much during
production compared to user databases, so it is typically not necessary to consider
them in I/O performance tuning considerations. The master database is usually used
only for adding new logins, databases, devices, and other system objects.
Database Partitioning
Databases can be partitioned using files and/or filegroups. A filegroup is simply a
named collection of individual files grouped together for administration purposes. A file
cannot be a member of more than one filegroup. Tables, indexes, text, ntext, and
image data can all be associated with a specific filegroup. This means that all their
pages are allocated from the files in that filegroup. The three types of filegroups are
described below.
Primary filegroup
This filegroup contains the primary data file and any other files not placed into another
filegroup. All pages for the system tables are allocated from the primary filegroup.
User-defined filegroup
This filegroup is any filegroup specified using the FILEGROUP keyword in a CREATE
DATABASE or ALTER DATABASE statement, or on the Properties dialog box within SQL
Server Enterprise Manager.
Default filegroup
The default filegroup contains the pages for all tables and indexes that do not have a
filegroup specified when they are created. In each database, only one filegroup at a

time can be the default filegroup. If no default filegroup is specified, the default is the
primary filegroup.
Files and filegroups are useful for controlling the placement of data and indexes and to
eliminate device contention. Quite a few installations also leverage files and filegroups
as a mechanism that is more granular than a database in order to exercise more control
over their database backup/recovery strategy.
Horizontal Partitioning (Table)
Horizontal partitioning segments a table into multiple tables, each containing the same
number of columns but fewer rows. Determining how to partition the tables horizontally
depends on how data is analyzed. A general rule of thumb is to partition tables so
queries reference as few tables as possible. Otherwise, excessive UNION queries, used
to merge the tables logically at query time, can impair performance.
When you partition data across multiple tables or multiple servers, queries accessing
only a fraction of the data can run faster because there is less data to scan. If the
tables are located on different servers, or on a computer with multipl e processors, each
table involved in the query can also be scanned in parallel, thereby improving query
performance. Additionally, maintenance tasks, such as rebuilding indexes or backing up
a table, can execute more quickly.
By using a partitioned view, the data still appears as a single table and can be queried
as such without having to reference the correct underlying table manually
Cost Threshold for Parallelism Option
Use this option to specify the threshold where SQL Server creates and executes parallel
plans. SQL Server creates and executes a parallel plan for a query only when the
estimated cost to execute a serial plan for the same query is higher than the value set
in cost threshold for parallelism. The cost refers to an estimated elapsed time in
seconds required to execute the serial plan on a specific hardware configuration. Only
set cost threshold for parallelism on symmetric multiprocessors (SMP).
Max Degree of Parallelism Option
Use this option to limit the number of processors (a max of 32) to use in parallel plan
execution. The default value is 0, which uses the actual number of available CPUs. Set
this option to 1 to suppress parallel plan generation. Set the value to a number greater
than 1 to restrict the maximum number of processors used by a single query execution.
Priority Boost Option
Use this option to specify whether SQL Server should run at a higher scheduling priority
than other processors on the same computer. If you set this option to one, SQL Server
runs at a priority base of 13. The default is 0, which is a priority base of seven.
Set Working Set Size Option

Use this option to reserve physical memory space for SQL Server that is equal to the
server memory setting. The server memory setting is configured automatically by SQL
Server based on workload and available resources. It will vary dynamically between
min server memory and max server memory. Setting set working set size means the
operating system will not attempt to swap out SQL Server pages even if they can be
used more readily by another process when SQL Server is idle.
Optimizing Disk I/O Performance
When configuring a SQL Server that will contain only a few gigabytes of data and not
sustain heavy read or write activity, you need not be particularly concerned with the
subject of disk I/O and balancing of SQL Server I/O activity across hard drives for
maximum performance. To build larger SQL Server databases however, which will
contain hundreds of gigabytes or even terabytes of data and/or that can sustain heavy
read/write activity (as in a DSS application), it is necessary to drive configuration
around maximizing SQL Server disk I/O performance by load-balancing across multiple
hard drives.
Partitioning for Performance
For SQL Server databases that are stored on multiple disk drives, performance can be
improved by partitioning the data to increase the amount of disk I/O parallelism.
Partitioning can be done using a variety of techniques. Methods for creating and
managing partitions include configuring your storage subsystem (i.e., disk, RAID
partitioning) and applying various data configuration mechanisms in SQL Server such
as files, file groups, tables and views. Some possible candidates for partitioning include:
Transaction log
Tempdb
Database
Tables
Non-clustered indexes
Using bcp and BULK INSERT
Two mechanisms exist inside SQL Server to address the need for bulk movement of
data. The first mechanism is the bcp utility. The second is the BULK INSERT statement.
Bcp is a command prompt utility that copies data into or out of SQL Server.
BULK INSERT is a Transact-SQL statement that can be executed from within the
database environment. Unlike bcp, BULK INSERT can only pull data into SQL
Server. An advantage of using BULK INSERT is that it can copy data into
instances of SQL Server using a Transact-SQL statement, rather than having to
shell out to the command prompt.
TIP
Both of these mechanisms enable you to exercise control over the batch size.
Unless you are working with small volumes of data, it is good to get in the
habit of specifying a batch size for recoverability reasons. If none is specified,
SQL Server commits all rows to be loaded as a single batch. For example, you

attempt to load 1,000,000 rows of new data into a table. The server suddenly
loses power just as it finishes processing row number 999,999. When the
server recovers, those 999,999 rows will need to be rolled back out of the
database before you attempt to reload the data. By specifying a batch size of
10,000 you could have saved significant recovery time, because SQL Server
would have only had to rollback 9999 rows instead of 999,999.

General Guidelines for Initial Data Loads
While loading data:
Remove indexes
Use Bulk INSERT or bcp
Parallel load using partitioned data files into partitioned tables
Run one load stream for each available CPU
Set Bulk-Logged or Simple Recovery model
Use TABLOCK option
While loading data
Create indexes
Switch to the appropriate recovery model
Perform backups
General Guidelines for Incremental Data Loads
Load Data with indexes in place
Performance and concurrency requirements should determine locking granularity
(sp_indexoption).
Change from Full to Bulk-Logged Recovery mode unless there is an overriding need to
preserve a pointin time recovery, such as online users modifying the database during
bulk loads. Read operations should not affect bulk loads.


Performance Tuning Databases (Teradata)
Challenge
Best Practice covers tips on tuning Teradata.
Description
Teradata offers several bulk load utilities including FastLoad, MultiLoad, and TPump.
FastLoad is used for loading inserts into an empty table. One of TPumps advantages is
that it does not lock the table that is being loaded. MultiLoad supports inserts, updates,
deletes, and upserts to any table. This best practice will focus on MultiLoad since
PowerCenter 5.x can auto-generate MultiLoad scripts and invoke the MultiLoad utility
per PowerCenter target.
Tuning MultiLoad
There are many aspects to tuning a Teradata database. With PowerCenter 5.x several
aspects of tuning can be controlled by setting MultiLoad parameters to maximize write
throughput. Other areas to analyze when performing a MultiLoad job include estimating
space requirements and monitoring MultiLoad performance.
Note: In PowerCenter 5.1, the Informatica server transfers data via a UNIX named pipe
to MultiLoad, whereas in PowerCenter 5.0, the data is first written to file.
MultiLoad parameters
With PowerCenter 5.x, you can auto-generate MultiLoad scripts. This not only enhances
development, but also allows you to set performance options. Here are the MultiLoad-
specific parameters that are available in PowerCenter:
TDPID. A client based operand that is part of the logon string.
Date Format. Ensure that the date format used in your target flat file is
equivalent to the date format parameter in your MultiLoad script. Also validate
that your date format is compatible with the date format specified in the
Teradata database.
Checkpoint. A checkpoint interval is similar to a commit interval for other
databases. When you set the checkpoint value to less than 60, it represents the
interval in minutes between checkpoint operations. If the checkpoint is set to a

value greater than 60, it represents the number of records to write before
performing a checkpoint operation. To maximize write speed to the database,
try to limit the number of checkpoint operations that are performed.
Tenacity. Interval in hours between MultiLoad attempts to log on to the database
when the maximum number of sessions are already running.
Load Mode. Available load methods include Insert, Update, Delete, and Upsert.
Consider creating separate external loader connections for each method,
selecting the one that will be most efficient for each target table.
Drop Error Tables. Allows you to specify whether to drop or retain the three error
tables for a MultiLoad session. Set this parameter to 1 to drop error tables or 0
to retain error tables.
Max Sessions. Available only in PowerCenter 5.1, this parameter specifies the
maximum number of sessions that are allowed to log on to the database. This
value should not exceed one per working amp (Access Module Process).
Sleep. Available only in PowerCenter 5.1, this parameter specifies the number of
minutes that MultiLoad waits before retrying a logon operation.

Estimating Space Requirements for MultiLoad Jobs
Always estimate the final size of your MultiLoad target tables and make sure the
destination has enough space to complete your MultiLoad job. In addition to the space
that may be required by target tables, each MultiLoad job needs permanent space for:
Work tables
Error tables
Restart Log table
Note: Spool space cannot be used for MultiLoad work tables, error tables, or the
restart log table. Spool space is freed at each restart. By using permanent space for the
MultiLoad tables, data is preserved for restart operations after a system failure. Work
tables, in particular, require a lot of extra permanent space. Also remember to account
for the size of error tables since error tables are generated for each target table.
Use the following formula to prepare the preliminary space estimate for one target
table, assuming no fallback protection, no journals, and no non-unique secondary
indexes:
PERM = (using data size + 38) x (number of rows processed) x (number of apply
conditions satisfied) x (number of Teradata SQL statements within the applied DML)
Make adjustments to your preliminary space estimates according to the requirements
and expectations of your MultiLoad job.
Monitoring MultiLoad Performance
Here are some tips for analyzing MultiLoad performance:
1. Determine which phase of the MultiLoad job is causing poor performance.

If the performance bottleneck is during the acquisition phase, as data is acquired
from the client system, then the issue may be with the client system. If it is
during the application phase, as data is applied to the target tables, then the
issue is not likely to be with the client system.
The MultiLoad job output lists the job phases and other useful information. Save
these listings for evaluation.
2. Use the Teradata RDBMS Query Session utility to monitor the progress of the
MultiLoad job.
3. Check for locks on the MultiLoad target tables and error tables.
4. Check the DBC.Resusage table for problem areas, such as data bus or CPU
capacities at or near 100 percent for one or more processors.
5. Determine whether the target tables have non-unique secondary indexes
(NUSIs). NUSIs degrade MultiLoad performance because the utility builds a
separate NUSI change row to be applied to each NUSI sub-table after all of the
rows have been applied to the primary table.
6. Check the size of the error tables. Write operations to the fallback error tables
are performed at normal SQL speed, which is much slower than normal
MultiLoad tasks.
7. Verify that the primary index is unique. Non-unique primary indexes can cause
severe MultiLoad performance problems


Performance Tuning UNIX Systems
Challenge
Identify opportunities for performance improvement within the complexities of the UNIX
operating environment.
Description
This section provides an overview of the subject area, followed by discussion of detailed
usage of specific tools.
Overview
All system performance issues are basically resource contention issues. In any
computer system, there are three fundamental resources: CPU, memory, disk IO and
network IO. From this standpoint, performance tuning for PowerCenter means ensuring
that the PowerCenter Server and its sub processes get adequate resources to execute
in a timely and efficient manner.
Each resource has its own particular set of problems. Resource problems are
complicated because all resources interact with one another. Performance tuning is
about identifying bottlenecks and making trade-off to improve the situation. Your best
approach is to initiallytake a baseline measurement and come out with a
characterization of the system to provide a good understanding of how it behaves, then
evaluate any bottleneck showed on each system resource during your load window and
determine the removal of what resource contention offers the greatest opportunity for
performance enhancement.
Here is a summary of each system resource area and the problems it can have.
CPU

On any multiprocessing and multiuser system many processes want to use the
CPUs at the same time. The UNIX kernel is responsible for allocation of a finite
number of CPU cycles across all running processes. If the total demand on the
CPU exceeds its finite capacity, then all processing will reflect a negative impact
on performance; the system scheduler will put each process in a queue to wait
for CPU availability.

An average of the count of active processes in the system for the last 1, 5, and 15
minutes is reported as load average when you execute the command uptime.
The load average provides you a basic indicator of the number of contenders for
CPU time. Likewise vmstat command provides an average usage of all the CPUs
along with the number of processes contending for CPU (the value under the r
column).
On SMP (symmetric multiprocessing) architecture servers watch the even
utilization of all the CPUs. How well all the CPUs are utilized depends on how
well an application can be parallelized, If a process is incurring a high degree of
involuntary context switch by the kernel; perhaps binding the process to a
specific CPU might improve performance.
Memory

Memory contention arises when the memory requirements of the active processes
exceed the physical memory available on the system; at this point, the system
is out of memory. To handle this lack of memory the system starts paging, or
moving portions of active processes to disk in order to reclaim physical memory.
At this point, performance decreases dramatically. Paging is distinguished from
swapping, which means moving entire processes to disk and reclaiming their
space. Paging and excessive swapping indicate that the system can't provide
enough memory for the processes that are currently running.
Commands such as vmstat and pstat show whether the system is paging; ps,
prstat and sar can report the memory requirements of each process.
Disk IO

The I/O subsystem is a common source of resource contention problems. A finite
amount of I/O bandwidth must be shared by all the programs (including the
UNIX kernel) that currently run. The system's I/O buses can transfer only so
many megabytes per second; individual devices are even more limited. Each
kind of device has its own peculiarities and, therefore, its own problems.
There are tools for evaluation specific parts of the subsystem
o iostat can give you information about the transfer rates for each disk drive
o ps and vmstat can give some information about how many processes are
blocked waiting for I/O
o sar can provide voluminous information about I/O efficiency
o sadp can give detailed information about disk access patterns
Network IO

It is very likely the source data, the target data or both are connected through an
Ethernet channel to the system where PowerCenter is residing. Take into
consideration the number of Ethernet channels and bandwidth available to avoid
congestion.
o netstat shows packet activity on a network, watch for high collision rate of
output packets on each interface.

o nfstat monitors NFS traffic; execute nfstat c from a client machine (not
from the nfs server); watch for high time rate of total call and not
responding message.
Given that these issues all boil down to access to some computing resource, mitigation
of each issue consists of making some adjustment to the environment to provide more
(or preferential) access to the resource; for instance:
Adjust execution schedules to allow leverage of low usage times may improve
availability of memory, disk, network bandwidth, CPU cycles, etc.
Migrating other applications to other hardware will reduce demand on the
hardware hosting PowerCenter
For CPU intensive sessions, raising CPU priority (or lowering priority for competing
processes) provides more CPU time to the PowerCenter sessions
Adding hardware resource, such as adding more memory, will make more resource
available to all processes
Re-configuring existing resources may provide for more efficient usage, such as
assigning different disk devices for input and output, striping disk devices, or
adjusting network packet sizes

Detailed Usage

The following tips have proven useful in performance tuning UNIX-based machines.
While some of these tips will be more helpful than others in a particular environment,
all are worthy of consideration.
Availability, syntax and format of each will vary across UNIX versions.

Running ps -axu
Run ps -axu to check for the following items:
Are there any processes waiting for disk access or for paging? If so check the I/O
and memory subsystems.
What processes are using most of the CPU? This may help you distribute the
workload better.
What processes are using most of the memory? This may help you distribute the
workload better.
Does ps show that your system is running many memory-intensive jobs? Look for
jobs with a large set (RSS) or a high storage integral.
Identifying and Resolving Memory Issues
Use vmstat or sar to check for paging/swapping actions. Check the system to
ensure that excessive paging/swapping does not occur at any time during the session
processing. By using sar 5 10 or vmstat 1 10, you can get a snapshot of
paging/swapping. If paging or excessive swapping does occur at any time, increase
memory to prevent it. Paging/swapping, on any database system, causes a major

performance decrease and increased I/O. On a memory-starved and I/O-bound server,
this can effectively shut down the PowerCenter process and any databases running on
the server.
Some swapping may occur normally regardless of the tuning settings. This occurs
because some processes use the swap space by their design. To check swap space
availability, use pstat and swap. If the swap space is too small for the intended
applications, it should be increased.
Runvmstate 5 (sar wpgr ) for SunOS, vmstat S 5 to detect and confirm memory
problems and check for the following:
Are pages-outs occurring consistently? If so, you are short of memory.
Are there a high number of address translation faults? (System V only) This
suggests a memory shortage.
Are swap-outs occurring consistently? If so, you are extremely short of memory.
Occasional swap-outs are normal; BSD systems swap-out inactive jobs. Long
bursts of swap-outs mean that active jobs are probably falling victim and
indicate extreme memory shortage. If you dont have vmstat S, look at the w
and de fields of vmstat. These should ALWAYS be zero.
If memory seems to be the bottleneck of the system, try following remedial steps:
Reduce the size of the buffer cache, if your system has one, by decreasing
BUFPAGES.
If you have statically allocated STREAMS buffers, reduce the number of large
(2048- and 4096-byte) buffers. This may reduce network performance, but
netstat-m should give you an idea of how many buffers you really need.
Reduce the size of your kernels tables. This may limit the systems capacity
(number of files, number of processes, etc.).
Try running jobs requiring a lot of memory at night. This may not help the memory
problems, but you may not care about them as much.
Try running jobs requiring a lot of memory in a batch queue. If only one memory-
intensive job is running at a time, your system may perform satisfactorily.
Try to limit the time spent running sendmail, which is a memory hog.
If you dont see any significant improvement, add more memory.

Identifying and Resolving Disk I/O Issues
Use iostat to check i/o load and utilization, as well as CPU load. Iostat can be used
to monitor the I/O load on the disks on the UNIX server. Using iostat permits
monitoring the load on specific disks. Take notice of how fairly disk activity is
distributed among the system disks. If it is not, are the most active disks also the
fastest disks?
Run sadp to get a seek histogram of disk activity. Is activity concentrated in one
area of the disk (good), spread evenly across the disk (tolerable), or in two well-defined
peaks at opposite ends (bad)?

Reorganize your file systems and disks to distribute I/O activity as evenly as
possible.
Using symbolic links helps to keep the directory structure the same throughout
while still moving the data files that are causing I/O contention.
Use your fastest disk drive and controller for your root file system; this will almost
certainly have the heaviest activity. Alternatively, if single-file throughput is
important, put performance-critical files into one file system and use the fastest
drive for that file system.
Put performance-critical files on a file system with a large block size: 16KB or
32KB (BSD).
Increase the size of the buffer cache by increasing BUFPAGES (BSD). This may
hurt your systems memory performance.
Rebuild your file systems periodically to eliminate fragmentation (backup, build a
new file system, and restore).
If you are using NFS and using remote files, look at your network situation. You
dont have local disk I/O problems.
Check memory statistics again by running vmstat 5 (sar-rwpg). If your system
is paging or swapping consistently, you have memory problems, fix memory
problem first. Swapping makes performance worse.
If your system has disk capacity problem and is constantly running out of disk
space, try the following actions:
Write a find script that detects old core dumps, editor backup and auto-save files,
and other trash and deletes it automatically. Run the script through cron.
Use the disk quota system (if your system has one) to prevent individual users
from gathering too much storage.
Use a smaller block size on file systems that are mostly small files (e.g., source
code files, object modules, and small data files).

Identifying and Resolving CPU Overload Issues
Use uptime or sar -u to check for CPU loading. sar provides more detail, including
%usr (user), %sys (system), %wio (waiting on I/O), and %idle (% of idle time). A
target goal should be %usr + %sys= 80 and %wio = 10 leaving %idle at 10. If %wio
is higher, the disk and I/O contention should be investigated to eliminate I/O bottleneck
on the UNIX server. If the system shows a heavy load of %sys, and %usr has a high
%idle, this is indicative of memory and contention of swapping/paging problems. In this
case, it is necessary to make memory changes to reduce the load on the system server.
When you run iostat 5 above, also observe for CPU idle time. Is the idle time always
0, without letup? It is good for the CPU to be busy, but if it is always busy 100 percent
of the time, work must be piling up somewhere. This points to CPU overload.
Eliminate unnecessary daemon processes. rwhod and routed are particularly
likely to be performance problems, but any savings will help.
Get users to run jobs at night with at or any queuing system thats available
always for help. You may not care if the CPU (or the memory or I/O system) is
overloaded at night, provided the work is done in the morning.
Use nice to lower the priority of CPU-bound jobs will improve interactive
performance. Also, using nice to raise the priority of CPU-bound jobs will

expedite them but will hurt interactive performance. In general though, using
nice is really only a temporary solution. If your workload grows, it will soon
become insufficient. Consider upgrading your system, replacing it, or buying
another system to share the load.

Identifying and Resolving Network IO Issues
You can suspect problems with network capacity or with data integrity if users
experience slow performance when they are using rlogin or when they are accessing
files via NFS.
Look at netsat-i. If the number of collisions is large, suspect an overloaded network.
If the number of input or output errors is large, suspect hardware problems. A large
number of input errors indicate problems somewhere on the network. A large number
of output errors suggests problems with your system and its interface to the network.
If collisions and network hardware are not a problem, figure out which system
appears to be slow. Use spray to send a large burst of packets to the slow system. If
the number of dropped packets is large, the remote system most likely cannot respond
to incoming data fast enough. Look to see if there are CPU, memory or disk I/O
problems on the remote system. If not, the system may just not be able to tolerate
heavy network workloads. Try to reorganize the network so that this system isnt a file
server.
A large number of dropped packets may also indicate data corruption. Run netstat-
s on the remote system, then spray the remote system from the local system and run
netstat-s again. If the increase of UDP socket full drops (as indicated by netstat) is
equal to or greater than the number of drop packets that spray reports, the remote
system is slow network server If the increase of socket full drops is less than the
number of dropped packets, look for network errors.
Run nfsstat and look at the client RPC data. If the retransfield is more than 5 percent
of calls, the network or an NFS server is overloaded. If timeout is high, at least one NFS
server is overloaded, the network may be faulty, or one or more servers may have
crashed. If badmixis roughly equal to timeout, at least one NFS server is overloaded. If
timeoutand retrans are high, but badxidis low, some part of the network between the
NFS client and server is overloaded and dropping packets.
Try to prevent users from running I/O- intensive programs across the
network. The greputility is a good example of an I/O intensive program. Instead, have
users log into the remote system to do their work.
Reorganize the computers and disks on your network so that as many users as
possible can do as much work as possible on a local system.
Use systems with good network performance as file servers.
lsattr E l sys0 is used to determine some current settings on some UNIX
environments. In Solaris you execute prtenv Of particular attention is maxuproc.
Maxuproc is the setting to determine the maximum level of user background processes.

On most UNIX environments, this is defaulted to 40 but should be increased to 250 on
most systems.
Choose a File System Be sure to check the database vendor documentation to
determine the best file system for the specific machine. Typical choices include: s5, The
UNIX System V File System; ufs, The UNIX File System derived from Berkeley (BSD);
vxfs, The Veritas File System; and lastly raw devices that, in reality are not a file
system at all.
Use PMProcs Utility ( PowerCenter Utility), to view the current Informatica
processes. For example:
harmon 125: pmprocs

<------------ Current PowerMart processes --------------->

UID PID PPID C STIME TTY TIME CMD
powermar 2711 1421 16 18:13:11 ? 0:07 dtm pmserver.cfg 0 202 -289406976
powermar 1421 1 1 08:39:19 ? 1:30 pmserver

<------------ Current Shared Memory Resources --------------->

IPC status from <running system> as of Tue Feb 16 18:13:55 1999
T ID KEY MODE OWNER GROUP SEGSZ CPID LPID
Shared Memory:
m 0 0x094e64a5 --rw-rw---- oracle dba 20979712 1254 1273
m 1 0x0927e9b2 --rw-rw---- oradba dba 21749760 1331 2478
m 202 00000000 --rw------- powermar pm4 5000000 1421 2714
m 8003 00000000 --rw------- powermar pm4 25000000 2711 2711
m 4 00000000 --rw------- powermar pm4 25000000 2714 2714

<------------ Current Semaphore Resources --------------->

There are 19 Semaphores held by PowerMart processes
A few points about the Pmprocs utility:
Pmprocs is a script that combines the ps and ipcs commands
Only available for UNIX
CPID - Creator PID
LPID - Last PID that accessed the resource
Semaphores - used to sync the reader and writer
0 or 1 - shows slot in LM shared memory


Performance Tuning Windows NT/2000 Systems
Challenge
The Microsoft Windows NT/2000 environment is easier to tune than UNIX environments
but offers limited performance options. NT is considered a self-tuning operating
system because it attempts to configure and tune memory to the best of its ability.
However, this does not mean that the NT System Administrator is entirely free from
performance improvement responsibilities.
Note: Tuning is essentially the same for both NT and 2000 based systems, with
differences for Windows 2000 noted in the last section.
Description
The following tips have proven useful in performance tuning NT-based machines. While
some are likely to be more helpful than others in any particular environment, all are
worthy of consideration.
The two places to begin tuning an NT server are:
Performance Monitor.
Performance tab (hit ctrl+alt+del, choose task manager, and click on the
Performance tab).
Although the Performance Monitor can be tracked in real-time, creating a result-set
representative of a full day is more likely to render an accurate view of system
performance.
Resolving Typical NT Problems
The following paragraphs describe some common performance problems in an NT
environment and suggest tuning solutions.
Load reasonableness. Assume that some software will not be well coded, and some
background processes (e.g., a mail server or web server) running on a single machine,
can potentially starve the machine's CPUs. In this situation, off-loading the CPU hogs
may be the only recourse.

Device Drivers. The device drivers for some types of hardware are notorious for
inefficient CPU clock cycles. Be sure to obtain the latest drivers from the hardware
vendor to minimize this problem.
Memory and services. Although adding memory to NT is always a good solution, it is
also expensive and usually must be planned to support the BANK system for EISA and
PCI architectures. Before adding memory, check the Services in Control Panel because
many background applications do not uninstall the old service when installing a new
version. Thus, both the unused old service and the new service may be using valuable
CPU memory resources.
I/O Optimization. This is, by far, the best tuning option for database applications in
the NT environment. If necessary, level the load across the disk devices by moving
files. In situations where there are multiple controllers, be sure to level the load across
the controllers too.
Using electrostatic devices and fast-wide SCSI can also help to increase performance.
Further, fragmentation can usually be eliminated by using a Windows NT/2000 disk
defragmentation product, regardless of whether the disk is formatted for FAT or NTFS.
Finally, on NT servers, be sure to implement disk stripping to split single data files
across multiple disk drives and take advantage of RAID (Redundant Arrays of
Inexpensive Disks) technology. Also increase the priority of the disk devices on the NT
server. NT, by default, sets the disk device priority low. Change the disk priority setting
in the Registry at service\lanman\server\parameters and add a key for ThreadPriority
of type DWORD with a value of 2.
Monitoring System Performance in Windows 2000
In Windows 2000, the PowerCenter Server uses system resources to process
transformation, session execution, and reading and writing of data. The PowerCenter
Server also uses system memory for other data such as aggregate, joiner, rank, and
cached lookup tables. With Windows 2000, you can use the system monitor in the
Performance Console of the administrative tools, or system tools in the task manager,
to monitor the amount of system resources used by the PowerCenter Server and to
identify system bottlenecks.
Windows 2000 provides the following tools (accessible under the Control
Panel/Administration Tools/Performance) for monitoring resource usage on your
computer:
System Monitor
Performance Logs and Alerts
These Windows 2000 monitoring tools enable you to analyze usage and detect
bottlenecks at the disk, memory, processor, and network level.
System Monitor
The System Monitor displays a graph which is flexible and configurable. You can copy
counter paths and settings from the System Monitor display to the Clipboard and paste

counter paths from Web pages or other sources into the System Monitor display.
Because the System Monitor is portable, it is useful in monitoring other systems that
require administration.
Note: Typing perfmon.exe at the command prompt causes the system to start System
Monitor, not Performance Monitor.
Performance Monitor
The Performance Logs and Alerts tool provides two types of performance-related logs
counter logs and trace logsand an alerting function.
Counter logs record sampled data about hardware resources and system services
based on performance objects and counters in the same manner as System Monitor.
They can, therefore, be viewed in System Monitor. Data in counter logs can be saved as
comma-separated or tab-separated files that are easily viewed with Excel.Trace logs
collect event traces that measure performance statistics associated with events such as
disk and file I/O, page faults, or thread activity. The alerting function allows you to
define a counter value that will trigger actions such as sending a network message,
running a program, or starting a log. Alerts are useful if you are not actively monitoring
a particular counter threshold value, but want to be notified when it exceeds or falls
below a specified value so that you can investigate and determine the cause of the
change. You may want to set alerts based on established performance baseline values
for your system.
Note:You must have Full Control access to a subkey in the registry in order to create or
modify a log configuration. (The subkey is
HKEY_CURRENT_MACHINE\SYSTEM\CurrentControlSet\Services\SysmonLog\Log_Queri
es).
The predefined log settings under Counter Logs (i.e., System Overview) are configured
to create a binary log that, after manual start-up, updates every 15 seconds and logs
continuously until it achieves a maximum size. If you start logging with the default
settings, data is saved to the Perflogs folder on the root directory and includes the
counters: Memory\ Pages/sec, PhysicalDisk(_Total)\Avg. Disk Queue Length, and
Processor(_Total)\ % Processor Time.
If you want to create your own log setting, press the right mouse on one of the log
types.


Platform Sizing
Challenge
Determining the appropriate platform size to support the PowerCenter environment
based on customer environments and requirements.
Description

The required platform size to support PowerCenter depends on each customers unique
environment and processing requirements. The PowerCenter engine allocates resources
for individual extraction, transformation, and load (ETL) jobs or sessions. Each session
has its own resource requirements. The resources required for the PowerCenter engine
depend on the number of sessions, what each session does while moving data, and how
many sessions run concurrently. This Best Practice outlines the relevant questions
pertinent to estimating the platform requirements.
TIP
An important concept regarding platform sizing is not to size your
environment too soon in the project lifecycle. Too often, clients size their
machines before any ETL is designed or developed, and in many cases these
platforms are too small for the resultant system. Thus, it is better to analyze
sizing requirements after the data transformation processes have been well
defined during the design and development phases.

Environment Questions
When considering a platform size, you should consider the following questions
regarding your environment:
What sources do you plan to access?
How do you currently access those sources?
Have you decided on the target environment (database/hardware/operating
system)? If so, what is it?
Have you decided on the PowerCenter server environment (hardware/operating
system)?
Is it possible for the PowerCenter server to be on the same machine as the target?
How do you plan to access your information (cube, ad-hoc query tool) and what
tools will you use to do this?
What other applications or services, if any, run on the PowerCenter server?

What are the latency requirements for the PowerCenter loads?

Engine Sizing Questions
When considering the engine size, you should consider the following questions:
Is the overall ETL task currently being done? If so, how do you do it, and how long
does it take?
What is the total volume of data to move?
What is the largest table (bytes and rows)? Is there any key on this table that
could be used to partition load sessions, if needed?
How often will the refresh occur?
Will refresh be scheduled at a certain time, or driven by external events?
Is there a "modified" timestamp on the source table rows?
What is the batch window available for the load?
Are you doing a load of detail data, aggregations, or both?
If you are doing aggregations, what is the ration of source/target rows for the
largest result set? How large is the result set (bytes and rows)?
The answers to these questions offer an approximate guide to the factors that affect
PowerCenter's resource requirements. To simplify the analysis, you can focus on large
jobs that drive the resource requirement.
Engine Resource Consumption
This following sections summarize some recommendations on the PowerCenter engine
resource consumption.
Processor
1-1.5 CPUs per concurrent non-partitioned session or transformation job.
Memory

20 to 30MB of memory for the main engine for session coordination.
20 to 30MB of memory per session, if there are no aggregations, lookups, or
heterogeneous data joins. Note that 32-bit systems have a operating system
limitation of 3GB per session.
Caches for aggregation, lookups or joins use additional memory:
Lookup tables are cached in full; the memory consumed depends on the size of the
tables.
Aggregate caches store the individual groups; more memory is used if there are
more groups.
Sorting the input to aggregations greatly reduces the need for memory.
Joins cache the master table in a join; memory consumed depends on the size of
the master.
Disk space

Disk space is not a factor if the machine is used only as the PowerCenter engine, unless
you have the following conditions:
Data is staged to flat files on the PowerCenter server machine.
Data is stored in incremental aggregation files for adding data to aggregates. The
space consumed is about the size of the data aggregated.
Temporary space is needed for paging for transformations that require large
caches that cannot be entirely cached by system memory
Sizing analysis
The basic goal is to size the machine so that all jobs can complete within the specified
load window. You should consider the answers to the questions in the "Environment"
and "Engine Sizing" sections to estimate the required number of sessions, the volume
of data that each session moves, and its lookup table, aggregation, and heterogeneous
join caching requirements. Use these estimates with the recommendations in the
"Engine Resource Consumption" section to determine the required number of
processors, memory, and disk space to achieve the required performance to meet the
load window.
Note that the deployment environment often creates performance constraints that
hardware capacity cannot overcome. The engine throughput is usually constrained by
one or more of the environmental factors addressed by the questions in the
"Environment" section. For example, if the data sources and target are both remote
from the PowerCenter server, the network is often the constraining factor. At some
point, additional sessions, processors, and memory might not yield faster execution
because the network (not the PowerCenter server) imposes the performance limit. The
hardware sizing analysis is highly dependent on the environment in which the server is
deployed. You need to understand the performance characteristics of the environment
before making any sizing conclusions.
It is also vitally important to remember that it is likely that other applications in
addition to PowerCenter may use the platform. It is very common for PowerCenter to
run on a server with a database engine and query/analysis tools. In fact, in an
environment where PowerCenter, the target database, and query/analysis tools all run
on the same machine, the query/analysis tool often drives the hardware requirements.
However, if the loading is performed after business hours, the query/analysis tools
requirements may not be a sizing limitation.


Recommended Performance Tuning Procedures
Challenge
Sometimes it is necessary to employ a series of performance tuning procedures in order
to optimize PowerCenter load times.
Description
When a PowerCenter session or workflow is not performing at the expected or desired
speed, there is a methodology that can be followed to help diagnose any problems that
might be aversely affecting all components of the data integration architecture. While
PowerCenter has its own performance settings that can be tuned, the entire data
integration architecture, including the UNIX/Windows servers, network, disk array, and
the source and target databases, must also be considered. More often than not, it is an
issue external to PowerCenter that is the cause of the performance problem. In order to
correctly and scientifically determine the most logical cause of the performance
problem, it is necessary to execute the performance tuning steps in a specific order.
This will allow you to methodically rule out individual pieces and narrow down the
specific areas in which to focus your tuning efforts on.
1. Perform Benchmarking
You should always have a baseline of your current load times for a given workflow or
session with a similar record count. Maybe you are not achieving your required load
window or simply think your processes could run more efficiently based on other similar
tasks currently running faster than the problem process. Use this benchmark to
estimate what your desired performance goal should be and tune to this goal. Start
with the problem mapping you have created along with a session and workflow that
uses all default settings. This allows you to systematically see exactly which changes
you make have a positive impact on performance.

2. Identify The Performance Bottleneck Area
This step will help greatly in narrowing down the areas in which to begin focusing.
There are five areas to focus on when performing the bottleneck diagnosis. The areas
in order of focus are:
Target

Source
Mapping
Session/Workflow
System.
The methodology will step you through a series of proven tests using PowerCenter to
identify trends that point where next to focus your time. Remember to go through
these tests in a scientific manner, running them multiple times before making a
conclusion, and also realize that identifying and fixing one bottleneck area may create a
different bottleneck. For more information, see Determining Bottlenecks.
3. Optimize "Inside" or "Outside" PowerCenter
Depending on the results of the bottleneck tests, optimize inside or outside
PowerCenter. Be sure to perform the bottleneck test in the order prescribed in
Determining Bottlenecks, since this is also the order in which you will make any
performance changes.
Problems outside PowerCenter refers to anything you find that indicates that the
source of the performance problem is outside of the PowerCenter mapping design or
workflow/session settings. This usually means a source/target database problem,
network bottleneck, or a server operating system problem. These are the most common
performance problems.
For Source database related bottlenecks, refer to the Tuning SQL Overrides and
Environment for Better Performance
For Target database related problems, refer to Performance Tuning Databases -
Oracle, SQL Server or Teradata
For operating system problems, refer to the Performance Tuning UNIX Systems or
Performance Tuning Windows NT/2000 Systems for more information.
Problems inside PowerCenter refers to anything that PowerCenter controls, such as
actual transformation logic, and PowerCenter Workflow/Session settings. The session
settings contain quite a few memory settings and partitioning options that can greatly
increase performance. Refer to the Tuning Sessions for Better Performance for more
information.
There are certain procedures to look at to optimize mappings; however be careful,
because in most cases, the mapping design is dictated by business logic. This means
that while there may be a more efficient way to perform the business logic within the
mapping, the actual necessary business logic cannot be ignored simply to increase
performance. Refer to Tuning Mappings for Better Performance for more information.
4. Re-Execute the Problem Workflow or Session
Re-execute the problem workflow or session, then benchmark the load performance
against the baseline. This step is iterative, and should be performed after any
performance-based setting is changed. You are trying to answer the question, Did your
performance change make a positive impact? If so, move on to the next bottleneck.
Be sure to make detailed documentation at every step along the way so you have a
clear path as to what has and hasnt been tried.

After the recommended steps have been taken for each relevant performance
bottleneck, re-run the problem workflow or session and compare the results to the
benchmark. Hopefully, you have met your initial performance goal and made a
significant performance impact. While it may seem like there are an enormous amount
of areas where a performance problem can arise, if you follow the steps for finding your
bottleneck, and apply some of the tuning techniques specific to it, you will achieve your
desired performance gain.


Tuning Mappings for Better Performance
Challenge
In general, mapping-level optimization takes time to implement, but can significantly
boost performance. Sometimes the mapping is the biggest bottleneck in the load
process because business rules determine the number and complexity of
transformations in a mapping.
Before deciding on the best route to optimize the mapping architecture, you need to
resolve some basic issues. Tuning mappings is a grouped approach. The first group can
be of assistance almost universally, bringing about a performance increase in all
scenarios. The second group of tuning processes may yield only small performance
increase, or can be of significant value, depending on the situation.
Some factors to consider when choosing tuning processes at the mapping level include
the specific environment, software/ hardware limitations, and the number of rows going
through a mapping. This Best Practice offers some guidelines for tuning mappings.
Description
Analyze mappings for tuning only after you have tuned the target and source for peak
performance. To optimize mappings, you generally reduce the number of
transformations in the mapping and delete unnecessary links between transformations.
For transformations that use data cache (such as Aggregator, Joiner, Rank, and Lookup
transformations), limit connected input/output or output ports. Doing so can reduce the
amount of data the transformations store in the data cache. Having too many Lookups
and Aggregators can encumber performance because each requires index cache and
data cache. Since both are fighting for memory space, decreasing the number of these
transformations in a mapping can help improve speed. Splitting them up into different
mappings is another option.
Limit the number of Aggregators in a mapping. A high number of Aggregators can
increase I/O activity on the cache directory. Unless the seek/access time is fast on the
directory itself, having too many Aggregators can cause a bottleneck. Similarly, too
many Lookups in a mapping causes contention of disk and memory, which can lead to
thrashing, leaving insufficient memory to run a mapping efficiently.
Consider Single-Pass Reading

If several mappings use the same data source, consider a single-pass reading.
Consolidate separate mappings into one mapping with either a single Source Qualifier
Transformation or one set of Source Qualifier Transformations as the data source for
the separate data flows.
Similarly, if a function is used in several mappings, a single-pass reading reduces the
number of times that function is called in the session.
Optimize SQL Overrides
When SQL overrides are required in a Source Qualifier, Lookup Transformation, or in
the update override of a target object, be sure the SQL statement is tuned. The extent
to which and how SQL can be tuned depends on the underlying source or target
database system. See the section Tuning SQL Overrides and Environment for Better
Performance for more information.
Scrutinize Datatype Conversions
PowerCenter Server automatically makes conversions between compatible datatypes.
When these conversions are performed unnecessarily, performance slows. For example,
if a mapping moves data from an integer port to a decimal port, then back to an integer
port, the conversion may be unnecessary.
In some instances however, datatype conversions can help improve performance. This
is especially true when integer values are used in place of other datatypes for
performing comparisons using Lookup and Filter transformations.
Eliminate Transformation Errors
Large numbers of evaluation errors significantly slow performance of the PowerCenter
Server. During transformation errors, the PowerCenter Server engine pauses to
determine the cause of the error, removes the row causing the error from the data
flow, and logs the error in the session log.
Transformation errors can be caused by many things including: conversion errors,
conflicting mapping logic, any condition that is specifically set up as an error, and so
on. The session log can help point out the cause of these errors. If errors recur
consistently for certain transformations, re-evaluate the constraints for these
transformations. Any source of errors should be traced and eliminated.
Optimize Lookup Transformations
There are a number of ways to optimize lookup transformations that are setup in a
mapping.
When to cache lookups
When caching is enabled, the PowerCenter Server caches the lookup table and queries
the lookup cache during the session. When this option is not enabled, the PowerCenter
Server queries the lookup table on a row-by-row basis. NOTE: All the tuning options
mentioned in this Best Practice assume that memory and cache sizing for lookups are

sufficient to ensure that caches will not page to disks. Information regarding memory
and cache sizing for Lookup transformations are covered in Best Practice: Tuning
Sessions for Better Performance.
A better rule of thumb than memory size is to determine the size of the potential
lookup cache with regard to the number of rows expected to be processed. For
example, consider the following example.
In Mapping X, the source and lookup contain the following number of records:
ITEMS (source):
5000
records
MANUFACTURER:
200
records
DIM_ITEMS:
100000
records
Number of Disk Reads

Consider the case where MANUFACTURER is the lookup table. If the lookup table is
cached, it will take a total of 5200 disk reads to build the cache and execute the lookup.
If the lookup table is not cached, then it will take a total of 10,000 total disk reads to
execute the lookup. In this case, the number of records in the lookup table is small in
comparison with the number of times the lookup is executed. So this lookup should be
cached. This is the more likely scenario.
Consider the case where DIM_ITEMS is the lookup table. If the lookup table is cached,
it will result in 105,000 total disk reads to build and execute the lookup. If the lookup
table is not cached, then the disk reads would total 10,000. In this case the number of
records in the lookup table is not small in comparison with the number of times the
lookup will be executed. Thus, the lookup should not be cached.
Use the following eight step method to determine if a lookup should be cached:
1. Code the lookup into the mapping.
Cached Lookup Un-cached Lookup
LKP_Manufacturer
Build Cache 200 0
Read Source Records 5000 5000
Execute Lookup 0 5000
Total # of Disk Reads 5200 100000
LKP_DIM_ITEMS
Build Cache 100000 0
Read Source Records 5000 5000
Execute Lookup 0 5000
Total # of Disk Reads 105000 10000

2. Select a standard set of data from the source. For example, add a where clause
on a relational source to load a sample 10,000 rows.
3. Run the mapping with caching turned off and save the log.
4. Run the mapping with caching turned on and save the log to a different name
than the log created in step 3.
5. Look in the cached lookup log and determine how long it takes to cache the
lookup object. Note this time in seconds: LOOKUP TIME IN SECONDS = LS.
6. In the non-cached log, take the time from the last lookup cache to the end of
the load in seconds and divide it into the number or rows being processed: NON-
CACHED ROWS PER SECOND = NRS.
7. In the cached log, take the time from the last lookup cache to the end of the
load in seconds and divide it into number or rows being processed: CACHED
ROWS PER SECOND = CRS.
8. Use the following formula to find the breakeven row point:

(LS*NRS*CRS)/(CRS-NRS) = X

Where X is the breakeven point. If your expected source records is less than X,
it is better to not cache the lookup. If your expected source records is more
than X, it is better to cache the lookup.

For example:

Assume the lookup takes 166 seconds to cache (LS=166).
Assume with a cached lookup the load is 232 rows per second (CRS=232).
Assume with a non-cached lookup the load is 147 rows per second (NRS = 147).

The formula would result in: (166*147*232)/(232-147) = 66,603.

Thus, if the source has less than 66,603 records, the lookup should not be
cached. If it has more than 66,603 records, then the lookup should be cached.
Sharing lookup caches
There are a number of methods for sharing lookup caches:
Within a specific session run for a mapping, if the same lookup is used
multiple times in a mapping, the PowerCenter Server will re-use the cache for
the multiple instances of the lookup. Using the same lookup multiple times in
the mapping will be more resource intensive with each successive instance. If
multiple cached lookups are from the same table but are expected to return
different columns of data, it may be better to setup the multiple lookups to bring
back the same columns even though not all return ports are used in all lookups.
Bringing back a common set of columns may reduce the number of disk reads.
Across sessions of the same mapping, the use of an unnamed persistent cache
allows multiple runs to use an existing cache file stored on the PowerCenter
Server. If the option of creating a persistent cache is set in the lookup
properties, the memory cache created for the lookup during the initial run is
saved to the PowerCenter Server. This can improve performance because the
Server builds the memory cache from cache files instead of the database. This
feature should only be used when the lookup table is not expected to change
between session runs.

Across different mappings and sessions, the use of a named persistent cache
allows sharing of an existing cache file.
Reducing the number of cached rows
There is an option to use a SQL override in the creation of a lookup cache. Options can
be added to the WHERE clause to reduce the set of records included in the resulting
cache.
NOTE: If you use a SQL override in a lookup, the lookup must be cached.
Optimizing the lookup condition
In the case where a lookup uses more than one lookup condition, set the conditions
with an equal sign first in order to optimize lookup performance.
Indexing the lookup table
The PowerCenter Server must query, sort, and compare values in the lookup condition
columns. As a result, indexes on the database table should include every column used
in a lookup condition. This can improve performance for both cached and un-cached
lookups.
In the case of a cached lookup, an ORDER BY condition is issued in the SQL
statement used to create the cache. Columns used in the ORDER BY condition
should be indexed. The session log will contain the ORDER BY statement.
In the case of an un-cached lookup, since a SQL statement is created for each row
passing into the lookup transformation, performance can be helped by indexing
columns in the lookup condition.
Optimize Filter and Router Transformations
Filtering data as early as possible in the data flow improves the efficiency of a
mapping. Instead of using a Filter Transformation to remove a sizeable number of rows
in the middle or end of a mapping, use a filter on the Source Qualifier or a Filter
Transformation immediately after the source qualifier to improve performance.
Avoid complex expressions when creating the filter condition. Filter
transformations are most effective when a simple integer or TRUE/FALSE expression is
used in the filter condition.
Filters or routers should also be used to drop rejected rows from an Update
Strategy transformation if rejected rows do not need to be saved.
Replace multiple filter transformations with a router transformation. This
reduces the number of transformations in the mapping and makes the mapping easier
to follow.
Optimize aggregator transformations

Aggregator Transformations often slow performance because they must group data
before processing it.
Use simple columns in the group by condition to make the Aggregator
Transformation more efficient. When possible, use numbers instead of strings or dates
in the GROUP BY columns. Also avoid complex expressions in the Aggregator
expressions, especially in GROUP BY ports.
Use the Sorted Input option in the Aggregator. This option requires that data sent to
the Aggregator be sorted in the order in which the ports are used in the Aggregator's
group by. The Sorted Input option decreases the use of aggregate caches. When it is
used, the PowerCenter Server assumes all data is sorted by group and, as a group is
passed through an Aggregator, calculations can be performed and information passed
on to the next transformation. Without sorted input, the Server must wait for all rows
of data before processing aggregate calculations. Use of the Sorted Inputs option is
usually accompanied by a Source Qualifier which uses the Number of Sorted Ports
option.
Use an Expression and Update Strategy instead of an Aggregator Transformation.
This technique can only be used if the source data can be sorted. Further, using this
option assumes that a mapping is using an Aggregator with Sorted Input option. In the
Expression Transformation, the use of variable ports is required to hold data from the
previous row of data processed. The premise is to use the previous row of data to
determine whether the current row is a part of the current group or is the beginning of
a new group. Thus, if the row is a part of the current group, then its data would be
used to continue calculating the current group function. An Update Strategy
Transformation would follow the Expression Transformation and set the first row of a
new group to insert and the following rows to update.
Joiner Transformation
Joining data from the same source
You can join data from the same source in the following ways:
Join two branches of the same pipeline.
Create two instances of the same source and join pipelines from these source
instances.
You may want to join data from the same source if you want to perform a calculation
on part of the data and join the transformed data with the original data. When you join
the data using this method, you can maintain the original data and transform parts of
that data within one mapping.
When you join data from the same source, you can create two branches of the pipeline.
When you branch a pipeline, you must add a transformation between the Source
Qualifier and the Joiner transformation in at least one branch of the pipeline. You must
join sorted data and configure the Joiner transformation for sorted input.
If you want to join unsorted data, you must create two instances of the same source
and join the pipelines.

For example, you have a source with the following ports:
Employee
Department
Total Sales
In the target table, you want to view the employees who generated sales that were
greater than the average sales for their respective departments. To accomplish this,
you create a mapping with the following transformations:

Sorter transformation. Sort the data.
Sorted Aggregator transformation. Average the sales data and group by
department. When you perform this aggregation, you lose the data for individual
employees. To maintain employee data, you must pass a branch of the pipeline
to the Aggregator transformation and pass a branch with the same data to the
Joiner transformation to maintain the original data. When you join both
branches of the pipeline, you join the aggregated data with the original data.
Sorted Joiner transformation. Use a sorted Joiner transformation to join the
sorted aggregated data with the original data.
Filter transformation. Compare the average sales data against sales data for
each employee and filter out employees with less than above average sales.
The following figure illustrates joining two branches of the same pipeline:

Note: You can also join data from output groups of the same transformation, such as
the Custom transformation or XML Source Qualifier transformation. Place a Sorter
transformation between each output group and the Joiner transformation and configure
the Joiner transformation to receive sorted input.
Joining two branches can affect performance if the Joiner transformation receives data
from one branch much later than the other branch. The Joiner transformation caches all
the data from the first branch, and writes the cache to disk if the cache fills. The Joiner
transformation must then read the data from disk when it receives the data from the
second branch. This can slow processing.
You can also join same source data by creating a second instance of the source. After
you create the second source instance, you can join the pipelines from the two source
instances.
The following figure shows two instances of the same source joined using a Joiner
transformation:
Note: When you join data using this method, the PowerCenter Server reads the source
data for each source instance, so performance can be slower than joining two branches
of a pipeline.
Use the following guidelines when deciding whether to join branches of a pipeline or
join two instances of a source:
Join two branches of a pipeline when you have a large source or if you can read
the source data only once. For example, you can only read source data from a
message queue once.
Join two branches of a pipeline when you use sorted data. If the source data is
unsorted and you use a Sorter transformation to sort the data, branch the
pipeline after you sort the data.
Join two instances of a source when you need to add a blocking transformation to
the pipeline between the source and the Joiner transformation.
Join two instances of a source if one pipeline may process much more slowly than
the other pipeline.
Performance Tips
Use the database to do the join when sourcing data from the same database
schema. Database systems usually can perform the join more quickly than the
PowerCenter Server, so a SQL override or a join condition should be used when joining
multiple tables from the same database schema.
Use Normal joins whenever possible. Normal joins are faster than outer joins and
the resulting set of data is also smaller.
Join sorted data when possible. You can improve session performance by
configuring the Joiner transformation to use sorted input. When you configure the
Joiner transformation to use sorted data, the PowerCenter Server improves
performance by minimizing disk input and output. You see the greatest performance
improvement when you work with large data sets.

For an unsorted Joiner transformation, designate as the master sourcethe
source with fewer rows. For optimal performance and disk storage, designate the
master source as the source with the fewer rows. During a session, the Joiner
transformation compares each row of the master source against the detail source. The
fewer unique rows in the master, the fewer iterations of the join comparison occur,
which speeds the join process.
For a sorted Joiner transformation, designate as the master sourcethe source
with fewer duplicate key values. For optimal performance and disk storage,
designate the master source as the source with fewer duplicate key values. When the
PowerCenter Server processes a sorted Joiner transformation, it caches rows for one
hundred keys at a time. If the master source contains many rows with the same key
value, the PowerCenter Server must cache more rows, and performance can be slowed.
Optimizing Sorted Joiner Transformations with Partitions
When you use partitions with a sorted Joiner transformation, you may optimize
performance by grouping data and using n:n partitions.
Add a hash auto-keys partition upstream of the sort origin
To obtain expected results and get best performance when partitioning a sorted Joiner
transformation, you must group and sort data. To group data, ensure that rows with
the same key value are routed to the same partition. The best way to ensure that data
is grouped and distributed evenly among partitions is to add a hash auto-keys or key-
range partition point before the sort origin. Placing the partition point before you sort
the data ensures that you maintain grouping and sort the data within each group.
Use n:n partitions
You may be able to improve performance for a sorted Joiner transformation by using
n:n partitions. When you use n:n partitions, the Joiner transformation reads master and
detail rows concurrently and does not need to cache all of the master data. This
reduces memory usage and speeds processing. When you use 1:n partitions, the Joiner
transformation caches all the data from the master pipeline and writes the cache to disk
if the memory cache fills. When the Joiner transformation receives the data from the
detail pipeline, it must then read the data from disk to compare the master and detail
pipelines.
Optimize Sequence Generator Transformations
Sequence Generator transformations need to determine the next available sequence
number, thus increasing the number of cached values property can increase
performance. This property determines the number of values the PowerCenter Server
caches at one time. If it is set to cache no values then the PowerCenter Server must
query the repository each time to determine the next number to be used. Note that
any cached values not used in the course of a session are lost since the sequence
generator value in the repository is set when it is called next time, to give the next set
of cache values.
Avoid External Procedure Transformations

For the most part, making calls to external procedures slows a session. If possible,
avoid the use of these Transformations, which include Stored Procedures, External
Procedures, and Advanced External Procedures.
Field-Level Transformation Optimization
As a final step in the tuning process, you can tune expressions used in transformations.
When examining expressions, focus on complex expressions and try to simplify them
when possible.
To help isolate slow expressions, do the following:
1. Time the session with the original expression.
2. Copy the mapping and replace half the complex expressions with a constant.
3. Run and time the edited session.
4. Make another copy of the mapping and replace the other half of the complex
expressions with a constant.
5. Run and time the edited session.
Processing field level transformations takes time. If the transformation expressions are
complex, then processing is even slower. Its often possible to get a 10 to 20 percent
performance improvement by optimizing complex field level transformations. Use the
target table mapping reports or the Metadata Reporter to examine the transformations.
Likely candidates for optimization are the fields with the most complex expressions.
Keep in mind that there may be more than one field causing performance problems.
Factoring out common logic
This can reduce the number of times a mapping performs the same logic. If a mapping
performs the same logic multiple times in a mapping, moving the task upstream in the
mapping may allow the logic to be done just once. For example, a mapping has five
target tables. Each target requires a Social Security Number lookup. Instead of
performing the lookup right before each target, move the lookup to a position before
the data flow splits.
Minimize function calls
Anytime a function is called it takes resources to process. There are several common
examples where function calls can be reduced or eliminated.
Aggregate function calls can sometime be reduced. In the case of each aggregate
function call, the PowerCenter Server must search and group the data.
Thus the following expression:
SUM(Column A) + SUM(Column B)
Can be optimized to:
SUM(Column A + Column B)

In general, operators are faster than functions, so operators should be used
whenever possible.
For example if you have an expression which involves a CONCAT function such as:
CONCAT(CONCAT(FIRST_NAME, ), LAST_NAME)
It can be optimized to:
FIRST_NAME || || LAST_NAME
Remember that IIF() is a function that returns a value, not just a logical test.
This allows many logical statements to be written in a more compact fashion.
For example:
IIF(FLG_A=Y and FLG_B=Y and FLG_C=Y, VAL_A+VAL_B+VAL_C,
IIF(FLG_A=Y and FLG_B=Y and FLG_C=N, VAL_A+VAL_B,
IIF(FLG_A=Y and FLG_B=N and FLG_C=Y, VAL_A+VAL_C,
IIF(FLG_A=Y and FLG_B=N and FLG_C=N, VAL_A,
IIF(FLG_A=N and FLG_B=Y and FLG_C=Y, VAL_B+VAL_C,
IIF(FLG_A=N and FLG_B=Y and FLG_C=N, VAL_B,
IIF(FLG_A=N and FLG_B=N and FLG_C=Y, VAL_C,
IIF(FLG_A=N and FLG_B=N and FLG_C=N, 0.0))))))))
IIF(FLG_A=Y, VAL_A, 0.0) + IIF(FLG_B=Y, VAL_B, 0.0) + IIF(FLG_C=Y, VAL_C, 0.0)
The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized
expression results in 3 IIFs, 3 comparisons and two additions.
Be creative in making expressions more efficient. The following is an example of
rework of an expression that eliminates three comparisons down to one:
For example:
IIF(X=1 OR X=5 OR X=9, 'yes', 'no')
IIF(MOD(X, 4) = 1, 'yes', 'no')

Calculate once, use many times
Avoid calculating or testing the same value multiple times. If the same sub-expression
is used several times in a transformation, consider making the sub-expression a local
variable. The local variable can be used only within the transformation in which it was
created. By calculating the variable only once and then referencing the variable in
following sub-expressions, performance will be increased.
Choose numeric versus string operations
The PowerCenter Server processes numeric operations faster than string operations.
For example, if a lookup is done on a large amount of data on two columns,
EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID
improves performance.
Optimizing char-char and char-varchar comparisons
When the PowerCenter Server performs comparisons between CHAR and VARCHAR
columns, it slows each time it finds trailing blank spaces in the row. To resolve this,
treat CHAR as the CHAR On Read option in the PowerCenter Server setup so that the
server does not trim trailing spaces from the end of CHAR source fields. (??CORRECT
INTERPRETATION??)
Use DECODE instead of LOOKUP
When a LOOKUP function is used, the PowerCenter Server must lookup a table in the
database. When a DECODE function is used, the lookup values are incorporated into the
expression itself so the server does not need to lookup a separate table. Thus, when
looking up a small set of unchanging values, using DECODE may improve performance.
Reduce the number of transformations in a mapping
Because there is always overhead involved in moving data between transformations,
try, whenever possible, to reduce the number of transformations. Also, resolve
unnecessary links between transformations to minimize the amount of data moved.
This is especially important with data being pulled from the Source Qualifier
Transformation.
Use pre- and post-session SQL commands
You can specify pre- and post-session SQL commands in the Properties tab of the
Source Qualifier transformation and in the Properties tab of the target instance in a
mapping. To increase the load speed, use these commands to drop indexes on the
target before the session runs, then recreate them when the session completes.
Apply the following guidelines when using SQL statements:
You can use any command that is valid for the database type. However, the
PowerCenter Server does not allow nested comments, even though the database
may.

You can use mapping parameters and variables in SQL executed against the
source, but not against the target.
Use a semi-colon (;) to separate multiple statements.
The PowerCenter Server ignores semi-colons within single quotes, double quotes,
or within /* ...*/.
If you need to use a semi-colon outside of quotes or comments, you can escape it
with a back slash (\).
The Workflow Manager does not validate the SQL.
Use environmental SQL
For relational databases, you can execute SQL commands in the database environment
when connecting to the database. You can use this for source, target, lookup, and
stored procedure connections. For instance, you can set isolation levels on the source
and target systems to avoid deadlocks. Follow the guidelines listed above for using the
SQL statements.


Tuning Sessions for Better Performance
Challenge
Running sessions is where the pedal hits the metal. A common misconception is that
this is the area where most tuning should occur. While it is true that various specific
session options can be modified to improve performance, this should not be the major
or only area of focus when implementing performance tuning.
Description
The greatest area for improvement at the session level usually involves tweaking
memory cache settings. The Aggregator (without sorted ports), Joiner, Rank, Sorter
and Lookup Transformations (with caching enabled) use caches. Review the memory
cache settings for sessions where the mappings contain any of these transformations.
The PowerCenter Server uses the index and data caches for each of these
transformations. If the allocated data or index cache is not large enough to store the
data, the PowerCenter Server stores the data in a temporary disk file as it processes
the session data. Each time the PowerCenter Server pages to the temporary file,
performance slows.
You can see when the PowerCenter Server pages to the temporary file by examining
the performance details. The Transformation_readfromdisk or
Transformation_writetodisk counters for any Aggregator, Rank, Lookup, Sorter, or
Joiner transformation indicate the number of times the PowerCenter Server must page
to disk to process the transformation. Since the data cache is typically larger than the
index cache, you should increase the data cache more than the index cache.
The PowerCenter Server creates the index and data cache files by default in the
PowerCenter Server variable directory, $PMCacheDir. The naming convention used by
the PowerCenter Server for these files is PM [type of transformation] [generated
session instance id number] _ [transformation instance id number] _ [partition
index].dat or .idx. For example, an aggregate data cache file would be named
PMAGG31_19.dat. The cache directory may be changed however, if disk space is a
constraint. Informatica recommends that the cache directory be local to the
PowerCenter Server. You may encounter performance or reliability problems when you
cache large quantities of data on a mapped or mounted drive.

If the PowerCenter Server requires more memory than the configured cache size, it
stores the overflow values in these cache files. Since paging to disk can slow session
performance, try to configure the index and data cache sizes to store the appropriate
amount of data in memory. Refer to Session Caches in the Workflow Administration
Guide for detailed information on determining cache sizes.
The PowerCenter Server writes to the index and data cache files during a session in the
following cases:
The mapping contains one or more Aggregator transformations, and the session is
configured for incremental aggregation.
The mapping contains a Lookup transformation that is configured to use a
persistent lookup cache, and the PowerCenter Server runs the session for the
first time.
The mapping contains a Lookup transformation that is configured to initialize the
persistent lookup cache.
The Data Transformation Manager (DTM) process in a session runs out of cache
memory and pages to the local cache files. The DTM may create multiple files
when processing large amounts of data. The session fails if the local directory
runs out of disk space.
When a session is running, the PowerCenter Server writes a message in the session log
indicating the cache file name and the transformation name. When a session completes,
the DTM generally deletes the overflow index and data cache files. However, index and
data files may exist in the cache directory if the session is configured for either
incremental aggregation or to use a persistent lookup cache. Cache files may also
remain if the session does not complete successfully.
If a cache file handles more than two gigabytes of data, the PowerCenter Server
creates multiple index and data files. When creating these files, the PowerCenter Server
appends a number to the end of the filename, such as PMAGG*.idx1 and PMAGG*.idx2.
The number of index and data files is limited only by the amount of disk space available
in the cache directory.
Aggregator Caches
Keep the following items in mind when configuring the aggregate memory cache sizes:
Allocate at least enough space to hold at least one row in each aggregate group.
Remember that you only need to configure cache memory for an Aggregator
transformation that does not use sorted ports. The PowerCenter Server uses
memory to process an Aggregator transformation with sorted ports, not cache
memory.
Incremental aggregation can improve session performance. When it is used, the
PowerCenter Server saves index and data cache information to disk at the end
of the session. The next time the session runs, the PowerCenter Server uses this
historical information to perform the incremental aggregation. The PowerCenter
Server names these files PMAGG*.dat and PMAGG*.idx and saves them to the
cache directory. Mappings that have sessions which use incremental aggregation

should be set up so that only new detail records are read with each subsequent
run.
When configuring Aggregate data cache size, remember that the data cache holds
row data for variable ports and connected output ports only. As a result, the
data cache is generally larger than the index cache. To reduce the data cache
size, connect only the necessary output ports to subsequent transformations.

Joiner Caches
When a session is run with a Joiner transformation, the PowerCenter Server reads from
master and detail sources concurrently and builds index and data caches based on the
master rows. The PowerCenter Server then performs the join based on the detail source
data and the cache data.
The number of rows the PowerCenter Server stores in the cache depends on the
partitioning scheme, the data in the master source, and whether or not you use sorted
input.
After the memory caches are built, the PowerCenter Server reads the rows from the
detail source and performs the joins. The PowerCenter Server uses the index cache to
test the join condition. When it finds source data and cache data that match, it
retrieves row values from the data cache.
Lookup Caches
Several options can be explored when dealing with Lookup transformation caches.
Persistent caches should be used when lookup data is not expected to change
often. Lookup cache files are saved after a session which has a lookup that uses
a persistent cache is run for the first time. These files are reused for subsequent
runs, bypassing the querying of the database for the lookup. If the lookup table
changes, you must be sure to set the Recache from Database option to
ensure that the lookup cache files are rebuilt.
Lookup caching should be enabled for relatively small tables. Refer to Best
Practice: Tuning Mappings for Better Performance to determine when lookups
should be cached. When the Lookup transformation is not configured for
caching, the PowerCenter Server queries the lookup table for each input row.
The result of the lookup query and processing is the same, regardless of
whether the lookup table is cached or not. However, when the transformation is
configured to not cache, the PowerCenter Server queries the lookup table
instead of the lookup cache. Using a lookup cache can sometimes increase
session performance.
Just like for a joiner, the PowerCenter Server aligns all data for lookup caches on
an eight-byte boundary, which helps increase the performance of the lookup.

Allocating buffer memory
When the PowerCenter Server initializes a session, it allocates blocks of memory to hold
source and target data. Sessions that use a large number of sources and targets may

require additional memory blocks. By default, a session has enough buffer blocks for
83 sources and targets. If you run a session that has more than 83 sources and
targets, you can increase the number of available memory blocks by adjusting the
following session parameters:
DTM buffer size - the default setting is 12,000,000 bytes.
Default buffer block size - the default size is 64,000 bytes.
To configure these settings, first determine the number of memory blocks the
PowerCenter Server requires to initialize the session. Then you can calculate the buffer
size and/or the buffer block size based on the default settings, to create the required
number of session blocks.
If there are XML sources or targets in the mappings, use the number of groups in the
XML source or target in the total calculation for the total number of sources and
targets.
Increasing the DTM Buffer Pool Size
The DTM Buffer Pool Size setting specifies the amount of memory the PowerCenter
Server uses as DTM buffer memory. The PowerCenter Server uses DTM buffer memory
to create the internal data structures and buffer blocks used to bring data into and out
of the server. When the DTM buffer memory is increased, the PowerCenter Server
creates more buffer blocks, which can improve performance during momentary
slowdowns.
If a session's performance details show low numbers for your source and target
BufferInput_efficiency and BufferOutput_efficiency counters, increasing the DTM buffer
pool size may improve performance.
Increasing DTM buffer memory allocation generally causes performance to improve
initially and then level off. When the DTM buffer memory allocation is increased, you
need to evaluate the total memory available on the PowerCenter Server. If a session is
part of a concurrent batch, the combined DTM buffer memory allocated for the sessions
or batches must not exceed the total memory for the PowerCenter Server system. You
can increase the DTM buffer size in the Performance settings of the Properties tab.
If you don't see a significant performance increase after increasing DTM buffer memory,
then it was not a factor in session performance.
Optimizing the Buffer Block Size
Within a session, you can modify the buffer block size by changing it in the advanced
section of the Config tab. This specifies the size of a memory block that is used to move
data throughout the pipeline. Each source, each transformation, and each target may
have a different row size, which results in different numbers of rows that can be fit into
one memory block.
Row size is determined in the server, based on number of ports, their data types, and
precisions. Ideally, buffer block size should be configured so that it can hold roughly 20
rows at a time. When calculating this, use the source or target with the largest row

size. The default is 64K. The buffer block size does not become a factor in session
performance until the number of rows falls below 10. Informatica recommends that the
size of the shared memory (which determines the number of buffers available to the
session) should not be increased at all unless the mapping is complex (i.e., more than
20 transformations).
Running concurrent sessions and workflows
The PowerCenter Server can process multiple sessions in parallel and can also process
multiple partitions of a pipeline within a session. If you have a symmetric multi-
processing (SMP) platform, you can use multiple CPUs to concurrently process session
data or partitions of data. This provides improved performance since true parallelism is
achieved. On a single processor platform, these tasks share the CPU, so there is no
parallelism.
To achieve better performance, you can create a workflow that runs several sessions in
parallel on one PowerCenter Server. This technique should only be employed on servers
with multiple CPUs available. Each concurrent session will use a maximum of 1.4 CPUs
for the first session, and a maximum of 1 CPU for each additional session. Also, it has
been noted that simple mappings (i.e., mappings with only a few transformations) do
not make the engine CPU-bound, and therefore use a lot less processing power than a
full CPU.
If there are independent sessions that use separate sources and mappings to populate
different targets, they can be placed in a single workflow and linked concurrently to run
at the same time. Alternatively, these sessions can be placed in different workflows that
are run concurrently.
If there is a complex mapping with multiple sources, you can separate it into several
simpler mappings with separate sources. This enables you to place concurrent sessions
for these mappings in a workflow to be run in parallel.
Partitioning sessions
Performance can be improved by processing data in parallel in a single session by
creating multiple partitions of the pipeline. If you use PowerCenter, you can increase
the number of partitions in a pipeline to improve session performance. Increasing the
number of partitions allows the PowerCenter Server to create multiple connections to
sources and process partitions of source data concurrently.
When you create or edit a session, you can change the partitioning information for each
pipeline in a mapping. If the mapping contains multiple pipelines, you can specify
multiple partitions in some pipelines and single partitions in others. Keep the following
attributes in mind when specifying partitioning information for a pipeline:
Location of partition points: The PowerCenter Server sets partition points at
several transformations in a pipeline by default. If you use PowerCenter, you can
define other partition points. Select those transformations where you think
redistributing the rows in a different way is likely to increase the performance
considerably.

Number of partitions: By default, the PowerCenter Server sets the number of
partitions to one. You can generally define up to 64 partitions at any partition
point. When you increase the number of partitions, you increase the number of
processing threads, which can improve session performance. Increasing the
number of partitions or partition points also increases the load on the server. If
the server contains ample CPU bandwidth, processing rows of data in a session
concurrently can increase session performance. However, if you create a large
number of partitions or partition points in a session that processes large
amounts of data, you can overload the system.
Partition types: The partition type determines how the PowerCenter Server
redistributes data across partition points. The Workflow Manager allows you to
specify the following partition types:
1. Round-robin partitioning: PowerCenter distributes rows of data evenly to all
partitions. Each partition processes approximately the same number of rows. In
a pipeline that reads data from file sources of different sizes, you can use round-
robin partitioning to ensure that each partition receives approximately the same
number of rows.
2. Hash Keys: the PowerCenter Server uses a hash function to group rows of data
among partitions. The PowerCenter Server groups the data based on a partition
key. There are two types of hash partitioning:
o Hash auto-keys: The PowerCenter Server uses all grouped or sorted ports
as a compound partition key. You can use hash auto-keys partitioning at
or before Rank, Sorter, and unsorted Aggregator transformations to
ensure that rows are grouped properly before they enter these
transformations.
o Hash User Keys: The PowerCenter Server uses a hash function to group
rows of data among partitions based on a user-defined partition key. You
choose the ports that define the partition key.
3. Key Range: The PowerCenter Server distributes rows of data based on a port or
set of ports that you specify as the partition key. For each port, you define a
range of values. The PowerCenter Server uses the key and ranges to send rows
to the appropriate partition. Choose key range partitioning where the sources or
targets in the pipeline are partitioned by key range.
4. Pass-through partitioning: The PowerCenter Server processes data without
redistributing rows among partitions. Therefore, all rows in a single partition
stay in that partition after crossing a pass-through partition point.
5. Database Partitioning partition: You can optimize session performance by using
the database partitioning partition type instead of the pass-through partition
type for IBM DB2 targets.
If you find that your system is under-utilized after you have tuned the application,
databases, and system for maximum single-partition performance, you can reconfigure
your session to have two or more partitions to make your session utilize more of the
hardware. Use the following tips when you add partitions to a session:
Add one partition at a time. To best monitor performance, add one partition at
a time, and note your session settings before you add each partition.

Set DTM buffer memory. For a session with n partitions, this value should be at
least n times the value for the session with one partition.
Set cached values for Sequence Generator. For a session with n partitions,
there should be no need to use the number of cached values property of the
Sequence Generator transformation. If you must set this value to a value
greater than zero, make sure it is at least n times the original value for the
session with one partition.
Partition the source data evenly. Configure each partition to extract the same
number of rows.
Monitor the system while running the session. If there are CPU cycles
available (twenty percent or more idle time) then performance may improve
forthis session by adding a partition.
Monitor the system after adding a partition. If the CPU utilization does not go
up, the wait for I/O time goes up, or the total data transformation rate goes
down, then there is probably a hardware or software bottleneck. If the wait for
I/O time goes up a significant amount, then check the system for hardware
bottlenecks. Otherwise, check the database configuration.
Tune databases and system. Make sure that your databases are tuned properly
for parallel ETL and that your system has no bottlenecks.

Increasing the target commit interval
One method of resolving target database bottlenecks is to increase the commit interval.
Each time the PowerCenter Server commits, performance slows. Therefore, the smaller
the commit interval, the more often the PowerCenter Server writes to the target
database and the slower the overall performance. If you increase the commit interval,
the number of times the PowerCenter Server commits decreases and performance may
improve.
When increasing the commit interval at the session level, you must remember to
increase the size of the database rollback segments to accommodate the larger number
of rows. One of the major reasons that Informatica has set the default commit interval
to 10,000 is to accommodate the default rollback segment / extent size of most
databases. If you increase both the commit interval and the database rollback
segments, you should see an increase in performance. In some cases though, just
increasing the commit interval without making the appropriate database changes may
cause the session to fail part way through (i.e., you may get a database error like
"unable to extend rollback segments" in Oracle).
Disabling high precision
If a session runs with high precision enabled, disabling high precision may improve
The Decimal datatype is a numeric datatype with a maximum precision of 28. To use a
high-precision Decimal datatype in a session, you must configure it so that the
PowerCenter Server recognizes this datatype by selecting Enable high precision in the
session property sheet. However, since reading and manipulating a high-precision
datatype (i.e., those with a precision of greater than 28) can slow the PowerCenter
Server, session performance may be improved by disabling decimal arithmetic. When
you disable high precision, the PowerCenter Server converts data to a double.

Reducing error tracking
If a session contains a large number of transformation errors, you may be able to
improve performance by reducing the amount of data the PowerCenter Server writes to
the session log.
To reduce the amount of time spent writing to the session log file, set the tracing level
to Terse. Terse tracing should only be set if the sessions run without problems and
session details are not required. At this tracing level, the PowerCenter Server does not
write error messages or row-level information for reject data. However, if terse is not
an acceptable level of detail, you may want to consider leaving the tracing level at
Normal and focus your efforts on reducing the number of transformation errors. Note
that the tracing level must be set to Normal in order to use the reject loading utility.
As an additional debug option (beyond the PowerCenter Debugger), you may set the
tracing level to verbose initialization or verbose data.
Verbose initialization logs initialization details in addition to normal, names of
index and data files used, and detailed transformation statistics.
Verbose data logs each row that passes into the mapping. It also notes where the
PowerCenter Server truncates string data to fit the precision of a column and
provides detailed transformation statistics. When you configure the tracing level
to verbose data, the PowerCenter Server writes row data for all rows in a block
when it processes a transformation.
However, the verbose initialization and verbose data logging options significantly affect
the session performance. Do not use Verbose tracing options except when testing
sessions. Always remember to switch tracing back to Normal after the testing is
complete.
The session tracing level overrides any transformation-specific tracing levels within the
mapping. Informatica does not recommend reducing error tracing as a long-term
response to high levels of transformation errors. Because there are only a handful of
reasons why transformation errors occur, it makes sense to fix and prevent any
recurring transformation errors. PowerCenter uses the mapping tracing level when the
session tracing level is set to none.


Tuning SQL Overrides and Environment for Better
Performance
Challenge
Tuning SQL Overrides and SQL queries within the source qualifier objects can improve
performance in selecting data from source database tables, which positively impacts the
overall session performance. This Best Practice explores ways to optimize a SQL query
within the source qualifier object. The tips here can be applied to any PowerCenter
mapping. While the SQL discussed here is executed in Oracle 8 and above, the
techniques are generally applicable, but specifics for other RDBMS products (e.g., SQL
Server, Sybase, etc.) are not included.
Description

SQL Queries Performing Data Extractions
Optimizing SQL queries is perhaps the most complex portion of performance tuning.
When tuning SQL, the developer must look at the type of execution being forced by
hints, the execution plan, and the indexes on the query tables in the SQL, the logic of
the SQL statement itself, and the SQL syntax. The following paragraphs discuss each of
these areas in more detail.
DB2 Coalesce and Oracle NVL
When examining data with NULLs, it is often necessary to substitute a value to make
comparisons and joins work. In Oracle, the NVL function is used, while in DB2, the
COALESCE function is used.
Here is an example of the Oracle NLV function:
SELECT DISTINCT bio.experiment_group_id, bio.database_site_code
FROM exp.exp_bio_result bio, sar.sar_data_load_log log
WHERE bio.update_date BETWEEN log.start_time AND log.end_time
AND NVL(bio.species_type_code, 'X') IN ('mice', 'rats', X)

AND log.seq_no = (SELECT MAX(seq_no) FROM sar.sar_data_load_log
WHERE load_status = 'P')
Here is the same query in DB2:
FROM bio_result bio, data_load_log log
AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', X)
AND log.seq_no = (SELECT MAX(seq_no) FROM data_load_log
WHERE load_status = 'P')

Surmounting the Single SQL Statement Limitation in Oracle or DB2: In-
line Views
In source qualifiers and lookup objects, you are limited to a single SQL statement.
There are several ways to get around this limitation.
You can create views in the database and use them as you would tables, either as
source tables, or in the FROM clause of the SELECT statement. This can simplify the
SQL and make it easier to understand, but it also makes it harder to maintain. The logic
is now in two places: in an Informatica mapping and in a database view
You can use in-line views which are SELECT statements in the FROM or WHERE clause.
This can help focus the query to a subset of data in the table and work more efficiently
than using a traditional join. Here is an example of an in-line view in the FROM clause:
SELECT N.DOSE_REGIMEN_TEXT as DOSE_REGIMEN_TEXT,
N.DOSE_REGIMEN_COMMENT as DOSE_REGIMEN_COMMENT,
N.DOSE_VEHICLE_BATCH_NUMBER as DOSE_VEHICLE_BATCH_NUMBER,
N.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID
FROM DOSE_REGIMEN N,
(SELECT DISTINCT R.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID
FROM EXPERIMENT_PARAMETER R,
NEW_GROUP_TMP TMP

WHERE R.EXPERIMENT_PARAMETERS_ID = TMP.EXPERIMENT_PARAMETERS_ID
AND R.SCREEN_PROTOCOL_ID = TMP.BDS_PROTOCOL_ID
) X
WHERE N.DOSE_REGIMEN_ID = X.DOSE_REGIMEN_ID
ORDER BY N.DOSE_REGIMEN_ID
Surmounting the Single SQL Statement Limitation in DB2: Using the
Common Table Expression temp tables and the WITH Clause
The Common Table Expression (CTE) stores data in temp tables during the execution of
the SQL statement. The WITH clause lets you assign a name to a CTE block. You can
then reference the CTE block multiple places in the query by specifying the query
name. For example:
WITH maxseq AS (SELECT MAX(seq_no) as seq_no FROM data_load_log WHERE
load_status = 'P')
FROM bio_result bio, data_load_log log
AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', X)
AND log.seq_no = maxseq. seq_no
Here is another example using a WITH clause that uses recursive SQL:
WITH PERSON_TEMP (PERSON_ID, NAME, PARENT_ID) AS
(SELECT PERSON_ID, NAME, PARENT_ID
FROM PARENT_CHILD
WHERE NAME IN (FRED, SALLY, JIM)
UNION ALL
SELECT C.PERSON_ID, C.NAME, C.PARENT_ID
FROM PARENT_CHILD C, PERSON_TEMP RECURS
WHERE C.PERSON_ID = RECURS.PERSON_ID

AND LEVEL < 5)
SELECT * FROM PERSON_TEMP
The PARENT_ID in any particular row refers to the PERSON_ID of the parent. Pretty
stupid since we all have two parents, but you get the idea. The LEVEL clause prevents
infinite recursion.
CASE (DB2) vs. DECODE (Oracle)
The CASE syntax is allowed in ORACLE, but you are much more likely to see the
DECODE logic, even for a single case since it was the only legal way to test a condition
in earlier versions.
DECODE is not allowed in DB2.
In Oracle:
SELECT EMPLOYEE, FNAME, LNAME,
DECODE (SALARY)
< 10000, NEED RAISE,
> 1000000, OVERPAID,
THE REST OF US) AS COMMENT
FROM EMPLOYEE
In DB2:
SELECT EMPLOYEE, FNAME, LNAME,
CASE
WHEN SALARY < 10000 THEN NEED RAISE
WHEN SALARY > 1000000 THEN OVERPAID
ELSE THE REST OF US
END AS COMMENT
FROM EMPLOYEE
Debugging Tip: Obtaining a Sample Subset

It is often useful to get a small sample of the data from a long running query that
returns a large set of data. The logic can be commented out or removed after it is put
in general use.
DB2 uses the FETCH FIRST n ROWS ONLY clause to do this as follows:
SELECT EMPLOYEE, FNAME, LNAME
FROM EMPLOYEE
WHERE JOB_TITLE = WORKERBEE
FETCH FIRST 12 ROWS ONLY
Oracle does it this way using the ROWNUM variable:
SELECT EMPLOYEE, FNAME, LNAME
FROM EMPLOYEE
WHERE JOB_TITLE = WORKERBEE
AND ROWNUM <= 12
INTERSECT, INTERSECT ALL, UNION, UNION ALL
Remember that both the UNION and INTERSECT operators return distinct rows, while
UNION ALL and INTERSECT ALL return all rows.
System Dates in Oracle and DB2
Oracle uses the system variable SYSDATE for the current time and date, and allows you
to display either the time and/or the date however you want with date functions.
Here is an example that returns yesterdays date in Oracle (default format as
mm/dd/yyyy):
SELECT TRUNC(SYSDATE) 1 FROM DUAL
DB2 uses the system variables, here called special registers, CURRENT DATE, CURRENT
TIME and CURRENT TIMESTAMP
Here is an example for DB2:
SELECT FNAME, LNAME, CURRENT DATE AS TODAY
FROM EMPLOYEE
Oracle: Using Hints

Hints affect the way a query or sub-query is executed and can therefore, provide a
significant performance increase in queries. Hints cause the database engine to
relinquish control over how a query is executed, thereby giving the developer control
over the execution. Hints are always honored unless execution is not possible. Because
the database engine does not evaluate whether the hint makes sense, developers must
be careful in implementing hints. Oracle has many types of hints: optimizer hints,
access method hints, join order hints, join operation hints, and parallel execution hints.
Optimizer and access method hints are the most common.
In the latest versions of Oracle, the Cost-based query analysis is built-in and Rule-
based analysis is no longer possible. It was in Rule-based Oracle systems that hints
mentioning specific indexes were most helpful. In Oracle version 9.2, however, the use
of /*+ INDEX */ hints may actually decrease performance significantly in many
cases. If you are using older versions of Oracle however, the use of the proper INDEX
hints should help performance.
The optimizer hint allows the developer to change the optimizer's goals when creating
the execution plan. The table below provides a partial list of optimizer hints and
descriptions.
Optimizer hints: Choosing the best join method
Sort/merge and hash joins are in the same group, but nested loop joins are very
different. Sort/merge involves two sorts while the nested loop involves no sorts. The
hash join also requires memory to build the hash table.
Hash joins are most effective when the amount of data is large and one table is much
larger than the other.
Here is an example of a select that performs best as a hash join:
SELECT COUNT(*) FROM CUSTOMERS C, MANAGERS M
WHERE C.CUST_ID = M.MANAGER_ID
Considerations Join Type
Better throughput Sort/Merge
Better response time Nested loop
Large subsets of data Sort/Merge
Index available to support join Nested loop
Limited memory and CPU available for sorting Nested loop
Parallel execution Sort/Merge or Hash
Joining all or most of the rows of large tables Sort/Merge or Hash
Joining small sub-sets of data and index available Nested loop
Hint Description
ALL_ROWS The database engine creates an execution plan that optimizes
for throughput. Favors full table scans. Optimizer favors
Sort/Merge
FIRST_ROWS The database engine creates an execution plan that optimizes
for response time. It returns the first row of data as quickly as

Hint Description
possible. Favors index lookups. Optimizer favors Nested-loops
CHOOSE The database engine creates an execution plan that uses cost-
based execution if statistics have been run on the tables. If
statistics have not been run, the engine uses rule-based
execution. If statistics have been run on empty tables, the
engine still uses cost-based execution, but performance is
extremely poor.
RULE The database engine creates an execution plan based on a
fixed set of rules.
USE NL Use nested loops
USE MERGE Use sort merge joins
HASH The database engine performs a hash scan of the table. This
hint is ignored if the table is not clustered.
Access method hints
Access method hints control how data is accessed. These hints are used to force the
database engine to use indexes, hash scans, or row id scans. The following table
provides a partial list of access method hints.
Hint Description
ROWID The database engine performs a scan of the table based on
ROWIDS.
INDEX DO NOT USE in Oracle 9.2 and above. The database engine
performs an index scan of a specific table, but in 9.2 and
above, the optimizer does not use any indexes other than those
mentioned.
USE_CONCAT The database engine converts a query with an OR condition
into two or more queries joined by a UNION ALL statement.
The syntax for using a hint in a SQL statement is as follows:
Select /*+ FIRST_ROWS */ empno, ename
From emp;
Select /*+ USE_CONCAT */ empno, ename
From emp;
SQL Execution and Explain Plan
The simplest change is forcing the SQL to choose either rule-based or cost-based
execution. This change can be accomplished without changing the logic of the SQL
query. While cost-based execution is typically considered the best SQL execution; it
relies upon optimization of the Oracle parameters and updated database statistics. If
these statistics are not maintained, cost-based query execution can suffer over time.
When that happens, rule-based execution can actually provide better execution time.

The developer can determine which type of execution is being used by running an
explain plan on the SQL query in question. Note that the step in the explain plan that is
indented the most is the statement that is executed first. The results of that statement
are then used as input by the next level statement.
Typically, the developer should attempt to eliminate any full table scans and index
range scans whenever possible. Full table scans cause degradation in performance.
Information provided by the Explain Plan can be enhanced using the SQL Trace Utility.
This utility provides the following additional information including:
The number of executions
The elapsed time of the statement execution
The CPU time used to execute the statement
The SQL Trace Utility adds value because it definitively shows the statements that are
using the most resources, and can immediately show the change in resource
consumption after the statement has been tuned and a new explain plan has been run.
Using Indexes
The explain plan also shows whether indexes are being used to facilitate execution. The
data warehouse team should compare the indexes being used to those available. If
necessary, the administrative staff should identify new indexes that are needed to
improve execution and ask the database administration team to add them to the
appropriate tables. Once implemented, the explain plan should be executed again to
ensure that the indexes are being used. If an index is not being used, it is possible to
force the query to use it by using an access method hint, as described earlier.
Reviewing SQL Logic
The final step in SQL optimization involves reviewing the SQL logic itself. The purpose
of this review is to determine whether the logic is efficiently capturing the data needed
for processing. Review of the logic may uncover the need for additional filters to select
only certain data, as well as the need to restructure the where clause to use indexes. In
extreme cases, the entire SQL statement may need to be re-written to become more
efficient.
Reviewing SQL Syntax
SQL Syntax can also have a great impact on query performance. Certain operators can
slow performance, for example:
EXISTS clauses are almost always used in correlated sub-queries. They are
executed for each row of the parent query and cannot take advantage of
indexes, while the IN clause is executed once and does use indexes, and may be
translated to a JOIN by the optimizer. If possible, replace EXISTS with an IN
clause. For example:
SELECT * FROM DEPARTMENTS WHERE DEPT_ID IN

(SELECT DISTINCT DEPT_ID FROM MANAGERS) -- Faster
SELECT * FROM DEPARTMENTS D WHERE EXISTS
(SELECT * FROM MANAGERS M WHERE M.DEPT_ID = D.DEPT_ID)
Situation Exists In
Index supports subquery Yes Yes
No Index to support subquery No
Table scans per
parent row
Yes
Table scan once
Sub-query returns many rows Probably not Yes
Sub-query returns one or a few rows Yes Yes
Most of the sub-query rows are eliminated
by the parent query
No Yes
Index in parent that match sub-query
columns
Possibly not since
the EXISTS cannot
use the index
Yes IN uses
the index
Where possible, use the EXISTS clause instead of the INTERSECT clause. Simply
modifying the query in this way can improve performance by more than100
percent.
Where possible, limit the use of outer joins on tables. Remove the outer joins from
the query and create lookup objects within the mapping to fill in the optional
information.

Choosing the Best Join Order
Place the smallest table first in the join order. This is often a staging table holding the
IDs identifying the data in the incremental ETL load.
Always put the small table column on the right side of the join. Use the driving table
first in the WHERE clause, and work from it outward. In other words, be consistent and
orderly about placing columns in the WHERE clause.
Outer joins limit the join order that the optimizer can use. Dont use them needlessly.
Anti-join with NOT IN, NOT EXISTS, MINUS or EXCEPT, OUTER JOIN

Avoid use of the NOT IN clause. This clause causes the database engine to perform
a full table scan. While this may not be a problem on small tables, it can become
a performance drain on large tables.
SELECT NAME_ID FROM CUSTOMERS
WHERE NAME_ID NOT IN
(SELECT NAME_ID FROM EMPLOYEES)

Avoid use of the NOT EXISTS clause. This clause is better than the NOT IN, but
still may cause a full table scan.
SELECT C.NAME_ID FROM CUSTOMERS C
WHERE NOT EXISTS
(SELECT * FROM EMPLOYEES E
WHERE C.NAME_ID = E.NAME_ID)
In Oracle, use the MINUS operator to do the anti-join, if possible. In DB2, use the
equivalent EXCEPT operator.
SELECT C.NAME_ID FROM CUSTOMERS C
MINUS
SELECT E.NAME_ID* FROM EMPLOYEES E
Also consider using outer joins with IS NULL conditions for anti-joins.
SELECT C.NAME_ID FROM CUSTOMERS C, EMPLOYEES E
WHERE C.NAME_ID = E.NAME_ID (+)
AND C.NAME_ID IS NULL
Review the database SQL manuals to determine the cost benefits or liabilities of certain
SQL clauses as they may change based on the database engine.
In lookups from large tables, try to limit the rows returned to the set of rows
matching the set in the source qualifier. Add the WHERE clause conditions to the
lookup. For example, if the source qualifier selects sales orders entered into the
system since the previous load of the database, then, in the product information
lookup, only select the products that match the distinct product IDs in the
incremental sales orders.
Avoid range lookups. This is a SELECT that uses a BETWEEN in the WHERE clause
that uses values retrieved from a table as limits in the BETWEEN. Here is an
example:
SELECT
R.BATCH_TRACKING_NO,
R.SUPPLIER_DESC,
R.SUPPLIER_REG_NO,
R.SUPPLIER_REF_CODE,

R.GCW_LOAD_DATE
FROM CDS_SUPPLIER R,
(SELECT LOAD_DATE_PREV AS LOAD_DATE_PREV,
L.LOAD_DATE) AS LOAD_DATE
FROM ETL_AUDIT_LOG L
WHERE L.LOAD_DATE_PREV IN
(SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV
FROM ETL_AUDIT_LOG Y)
) Z
WHERE
R.LOAD_DATE BETWEEN Z.LOAD_DATE_PREV AND Z.LOAD_DATE
The work-around is to use an in-line view to get the lower range in the FROM clause
and join it to the main query that limits the higher date range in its where clause. Use
an ORDER BY the lower limit in the in-line view. This is likely to reduce the throughput
time from hours to seconds.
Here is the improved SQL:
SELECT
R.BATCH_TRACKING_NO,
R.SUPPLIER_DESC,
R.SUPPLIER_REG_NO,
R.SUPPLIER_REF_CODE,
R.LOAD_DATE
FROM
/* In-line view for lower limit */
(SELECT
R1.BATCH_TRACKING_NO,

R1.SUPPLIER_DESC,
R1.SUPPLIER_REG_NO,
R1.SUPPLIER_REF_CODE,
R1.LOAD_DATE
FROM CDS_SUPPLIER R1,
(SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV
FROM ETL_AUDIT_LOG Y) Z
WHERE R1.LOAD_DATE >= Z.LOAD_DATE_PREV
ORDER BY R1.LOAD_DATE) R,
/* end in-line view for lower limit */
(SELECT MAX(D.LOAD_DATE) AS LOAD_DATE
FROM ETL_AUDIT_LOG D) A /* upper limit /*
WHERE R. LOAD_DATE <= A.LOAD_DATE
Tuning System Architecture
Use the following steps to improve the performance of any system:
1. Establish performance boundaries (baseline).
2. Define performance objectives.
3. Develop a performance monitoring plan.
4. Execute the plan.
5. Analyze measurements to determine whether the results meet the objectives. If
objectives are met, consider reducing the number of measurements
because performance monitoring itself uses system resources. Otherwise
continue with Step 6.
6. Determine the major constraints in the system.
7. Decide where the team can afford to make trade-offs and which resources can
bear additional load.
8. Adjust the configuration of the system. If it is feasible to change more than one
tuning option, implement one at a time. If there are no options left at any level,
this indicates that the system has reached its limits and hardware upgrades may
be advisable.
9. Return to Step 4 and continue to monitor the system.
10. Return to Step 1.
11. Re-examine outlined objectives and indicators.
12. Refine monitoring and tuning strategy.

System Resources
The PowerCenter Server uses the following system resources:
CPU
Load Manager shared memory
DTM buffer memory
Cache memory
When tuning the system, evaluate the following considerations during the
implementation process.
Determine if the network is running at an optimal speed. Recommended best
practice is to minimize the number of network hops between the PowerCenter
Server and the databases.
Use multiple PowerCenter Servers on separate systems to potentially improve
When all character data processed by the PowerCenter Server is US-ASCII or
EBCDIC, configure the PowerCenter Server for ASCII data movement mode. In
ASCII mode, the PowerCenter Server uses one byte to store each character. In
Unicode mode, the PowerCenter Server uses two bytes for each character, which
can potentially slow session performance
Check hard disks on related machines. Slow disk access on source and target
databases, source and target file systems, as well as the PowerCenter Server
and repository machines can slow session performance.
When an operating system runs out of physical memory, it starts paging to disk to
free physical memory. Configure the physical memory for the PowerCenter
Server machine to minimize paging to disk. Increase system memory when
sessions use large cached lookups or sessions have many partitions.
In a multi-processor UNIX environment, the PowerCenter Server may use a large
amount of system resources. Use processor binding to control processor usage
by the PowerCenter Server.
In a Sun Solaris environment, use the psrset command to create and manage a
processor set. After creating a processor set, use the pbind command to bind
the PowerCenter Server to the processor set so that the processor set only runs
the PowerCenter Sever. For details, see project system administrator and Sun
Solaris documentation.
In an HP-UX environment, use the Process Resource Manager utility to control CPU
usage in the system. The Process Resource Manager allocates minimum system
resources and uses a maximum cap of resources. For details, see project system
administrator and HP-UX documentation.
In an AIX environment, use the Workload Manager in AIX 5L to manage system
resources during peak demands. The Workload Manager can allocate resources
and manage CPU, memory, and disk I/O bandwidth. For details, see project
system administrator and AIX documentation.

Database Performance Features
Nearly everything is a trade-off in the physical database implementation. Work with the
DBA in determining which of the many available alternatives is the best implementation
choice for the particular database. The project team must have a thorough

understanding of the data, database, and desired use of the database by the end-user
community prior to beginning the physical implementation process. Evaluate the
following considerations during the implementation process.
Denormalization. The DBA can use denormalization to improve performance by
eliminating the constraints and primary key to foreign key relationships, and
also eliminating join tables.
Indexes. Proper indexing can significantly improve query response time. The
trade-off of heavy indexing is a degradation of the time required to load data
rows in to the target tables. Carefully written pre-session scripts are
recommended to drop indexes before the load and rebuilding them after the
load using post-session scripts.
Constraints. Avoid constraints if possible and try to exploit integrity enforcement
through the use of incorporating that additional logic in the mappings.
Rollback and Temporary Segments. Rollback and temporary segments are
primarily used to store data for queries (temporary) and INSERTs and UPDATES
(rollback). The rollback area must be large enough to hold all the data prior to a
COMMIT. Proper sizing can be crucial to ensuring successful completion of load
sessions, particularly on initial loads.
OS Priority. The priority of background processes is an often-overlooked problem
that can be difficult to determine after the fact. DBAs must work with the
System Administrator to ensure all the database processes have the same
priority.
Striping. Database performance can be increased significantly by implementing
either RAID 0 (striping) or RAID 5 (pooled disk sharing) disk I/O throughput.
Disk Controllers. Although expensive, striping and RAID 5 can be further
enhanced by separating the disk controllers.


Understanding and Setting UNIX Resources for
PowerCenter Installations
Challenge
This Best Practice explains what UNIX resource limits are, and how to control and
manage them.
Description
UNIX systems impose per-process limits on resources such as processor usage,
memory, and file handles. Understanding and setting these resources correctly is
essential for PowerCenter installations.
Understanding UNIX Resource Limits
UNIX systems impose limits on several different resources. The resources that can be
limited depend on the actual operating system (e.g., Solaris, AIX, Linux, or HPUX) and
the version of the operating system. In general, all UNIX systems implement per-
process limits on the following resources. There may be additional resource limits
depending on the operating system.
Resource Description
Processor time The maximum amount of processor time that can be
used by a process, usually in seconds.
Maximum file size The size of the largest single file a process can create.
Usually specified in blocks of 512 bytes.
Process data The maximum amount of data memory a process can
allocate. Usually specified in KB.
Process stack The maximum amount of stack memory a process can
allocate. Usually specified in KB.
Number of open files The maximum number of files that can be open
simultaneously.
Total virtual memory The maximum amount of memory a process can use,
including stack, instructions, and data. Usually specified
in KB.
Core file size The maximum size of a core dump file. Usually specified
in blocks of 512 bytes.

These limits are implemented on an individual process basis. The limits are also
inherited by child processes when they are created.
In practice, this means that the resource limits are typically set at logon time, and
apply to all processes started from the login shell. In the case of PowerCenter, any
limits in effect before the pmserver is started will also apply to all sessions (pmdtm)
started from that server. Any limits in effect when the repserver is started will also
apply to all repagents started from that repserver.
When a process exceeds its resource limit, UNIX will fail the operation that caused the
limit to be exceeded. Depending on the limit that is reached, memory allocations will
fail, files cant be opened, and processes will be terminated when they exceed their
processor time.
Since PowerCenter sessions often use a large amount of processor time, open many
files, and can use large amounts of memory, it is important to set resource limits
correctly to prevent the operating system from limiting access to required resources,
while preventing problems.
Hard and Soft Limits
Each resource that can be limited actually allows two limits to be specified a soft
limit and a hard limit. Hard and soft limits can be confusing.
From a practical point of view, the difference between hard and soft limits doesnt
matter to PowerCenter or any other process; the lower value is enforced when it
reached, whether it is a hard or soft limit.
The difference between hard and soft limits really only matters when changing resource
limits. The hard limits are the absolute maximums set by the system administrator that
can only be changed by the system administrator. The soft limits are recommended
values set by the System Administrator, and can be increased by the user, up to the
maximum limits.
UNIX Resource Limit Commands
The standard interface to UNIX resource limits is the ulimit shell command. This
command displays and sets resource limits. The C shell implements a variation of this
command called limit, which has different syntax but the same functions.
ulimit a Displays all soft limits
ulimit a H Displays all hard limits in effect
Recommended ulimit settings for a PowerCenter server:
Processor time Unlimited. This is needed for the pmserver and
pmrepserver that run forever.
Maximum file size Based on whats needed for the specific application. This

is an important parameter to keep a session from filling
a whole filesystem, but needs to be large enough to not
affect normal production operations.
Process data 1GB to 2GB
Process stack 32MB
Number of open files At least 256. Each network connection counts as a file
so source, target, and repository connections, as well as
cache files all use file handles.
Total virtual memory The largest expected size of a session. 1GB should
adequate, unless sessions are expected to create large
in-memory aggregate and lookup caches that require
more memory than this.
Core file size Unlimited, unless disk space is very tight. The largest
core files could be ~2-3GB but after analysis they should
be deleted, and there really shouldnt be multiple core
files lying around.
Setting Resource Limits
Resource limits are normally set in the login script, either .profile for the Korn shell or
.bash_profile for the bash shell. One ulimit command is required for each resource
being set, and usually the soft limit is set. A typical sequence is:
ulimit -S -c unlimited
ulimit -S -d 1232896
ulimit -S -s 32768
ulimit -S -t unlimited
ulimit -S -f 2097152
ulimit -S -n 1024
ulimit -S -v unlimited
after running this, the limits are changed:
% ulimit S a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) 1232896
file size (blocks, -f) 2097152
max memory size (kbytes, -m) unlimited
open files (-n) 1024
stack size (kbytes, -s) 32768
cpu time (seconds, -t) unlimited
virtual memory (kbytes, -v) unlimited
Setting or Changing Hard Resource Limits
Setting or changing hard resource limits varies across UNIX types. Most current UNIX
systems set the initial hard limits in the file /etc/profile, which must be changed by a
System Administrator. In some cases, it is necessary to run a system utility such as
smit on AIX to change the global system limits.



Upgrading PowerCenter
Challenge
Upgrading an existing version of PowerCenter to a later one encompasses upgrading
the repositories, implementing any necessary modifications, testing, and configuring
new features. The challenge here is to tackle the upgrade exercise in a structured
fashion and minimize risks to the repository and project work.
Some of the challenges typically encountered during an upgrade are:
Limiting the development downtime to a minimum
Ensuring that development work performed during the upgrade is accurately
migrated to the upgraded repository.
Ensuring that all the elements of all the various environments (e.g., Development,
Test, and Production) are upgraded.
Description
Some typical reasons for an upgrade include:
To take advantage of the new features in PowerCenter to enhance development
productivity and administration
To solve more business problems
To achieve data processing performance gains

Upgrade Team
Assembling a team of knowledgeable individuals to carry out the PowerCenter upgrade
is key to completing the process within schedule and budgetary guidelines. Typically,
the upgrade team needs the following key players:
PowerCenter Administrator
Database Administrator
System Administrator
Informatica team - the business and technical users that "own" the various areas
in the Informatica environment. These users are necessary for knowledge
transfer and to verify results after the upgrade is complete.

Upgrade Paths
The specific upgrade process depends on which of the existing PowerCenter versions
you are upgrading from and which version you are moving to. The following bullet
items summarize the upgrade paths for the various PowerCenter versions:
PowerCenter 7.0 (available since December 2003)
o Direct upgrade for PowerCenter 5.x to 7.x
o Direct upgrade for PowerCenter 6.x to 7.x
Other versions:
o For version 4.6 or earlier - upgrade to 5.x, and then to 7.x (or 6.x)
o For version 4.7 - upgrade to 5.x or 6.x, and then to 7.x
Upgrade Tips
Some of the following items may seem obvious, but adhering to these tips should help
to ensure that the upgrade process goes smoothly. Be sure to have sufficient memory
and disk space (database).
Remember that the version 7.x repository is 10 percent larger than the version 6.x
repository and as much as 35 percent larger than the version 5.x repository.
Always read the upgrade log file.
Backup Repository Server and PowerCenter Server configuration files prior to
beginning the upgrade process.
Remember that version 7.x uses registry while version 6.x used win.ini - and plan
accordingly for the change.
Test the AEP/EP (Advanced External Procedure/External Procedure) prior to
beginning the upgrade. Recompiling may be necessary.
If PowerCenter is running on Windows, you will need another Windows-based
machine to setup a parallel Development environment since two servers cannot
run on the same Windows machine.
If PowerCenter is running on a UNIX platform, you can setup a parallel
Development environment in a different directory, with a different user and
modified profile.
Ensure that all repositories for upgrade are backed up and that they can be
restored successfully. Repositories can be restored to the same database in
different schemas to allow an upgrade to be carried out in parallel. This is
especially useful if PowerCenter test and development environments reside in a
single repository.
Upgrading multiple projects
Be sure to consider the following items if the upgrade involves multiple projects:
All projects sharing a repository must upgrade at same time (test concurrently).
Projects using multiple repositories must all upgrade at same time.
After upgrade, each project should undergo full regression testing.
Upgrade project plan

The full upgrade process from version 5.x to 7.x can be extremely time consuming for a
large development environment. Informatica strongly recommends developing a project
plan to track progress and inform managers and team members of the tasks that need
to be completed.
Scheduling the upgrade
When an upgrade is scheduled in conjunction with other development work, it is
prudent to have it occur within a separate test environment that mimics production.
This reduces the risk of unexpected errors and can decrease the effort spent on the
upgrade. It may also allow the development work to continue in parallel with the
upgrade effort, depending on the specific site setup.
Upgrade Process
Informatica recommends using the following approach to handle the challenges
inherent in an upgrade effort.
Choosing an appropriate environment
It is advisable to have three separate environments: one each for Development, Test,
and Production.
The Test environment is generally the best place to start the upgrade process since it is
likely to be the most similar to Production. If possible, select a test sandbox that
parallels production as closely as possible. This will enable you to carry out data
comparisons between PowerCenter versions. And, if you begin the upgrade process in a
test environment, development can continue without interruption. Your corporate
policies on development, test, and sandbox environments and the work that can or
cannot be done in them will determine the precise order for the upgrade and any
associated development changes. Note that if changes are required as a result of the
upgrade, they will need to be migrated to Production. Use the existing version to
backup the PowerCenter repository, then ensure that the backup works by restoring it
to a new schema in the repository database.
Alternatively, you can begin the upgrade process in the Development environment or
set up a parallel environment in which to start the effort. The decision to use or copy an
existing platform depends on the state of project work across all environments. If it is
not possible to set up a parallel environment, the upgrade may start in Development,
then progress to the Test and Production systems. However, using a parallel
environment is likely to minimize development downtime. The important thing is to
understand the upgrade process and your own business and technical requirements,
then adapt the approaches described in this document to one that suits your particular
situation.
Organizing the upgrade effort
Begin by evaluating the entire upgrade effort in terms of resources, time, and
environments. This includes training, availability of database, operating system and
PowerCenter administrator resources as well as time to do the upgrade and carry out
the necessary testing in all environments. Refer to the release notes to help identify

mappings and other repository objects that may need changes as a result of the
upgrade.
Provide detailed training for the Upgrade team to ensure that everyone directly involved
in the upgrade process understands the new version and is capable of using it for their
own development work and assisting others with the upgrade process.
Run regression tests for all components on the old version. If possible, store the results
so that you can use them for comparison purposes after the upgrade is complete.
Before you begin the upgrade, be sure to backup the repository and server caches,
scripts, logs, bad files, parameter files, source and target files, and external
procedures. Also be sure to copy backed-up server files to the new directories as the
upgrade progresses.
If you are working in a UNIX environment and have to use the same machine for
existing and upgrade versions, be sure to use separate users, directories and ensure
that profile path statements do not overlap between the new and old versions of
PowerCenter. For additional information, refer to the system manuals for path
statements and environment variables for your platform and operating system.
Installing and configuring the software
Install the new version of the PowerCenter components on the server.
Ensure that the PowerCenter client is installed on at least one workstation to be
used for upgrade testing and that connections to repositories are updated if
parallel repositories are being used.
Re-compile the AEP/EP if needed and test them
Configure and start the repository server, (ensure that licensing keys are entered
in the Repository Server Administration Console for version 7.x.x).
Upgrade the repository using the Repository Manager, Repository Server
Administration Console, or the pmrepagent command depending on the version
you are upgrading to. Note that the Repository Server needs to be started in
order run the upgrade.
Configure the server details in the Workflow Manager and complete the server
configuration on the server (including license keys for version 7.x.x).
Start the PowerCenter server pmserver on UNIX or the Informatica service on the
a Microsoft Windows operating system.
Analyze upgrade activity logs to identify areas where changes may be required,
rerun full regression tests on the upgraded repository.
Run through test plans. Ensure that there are no failures and all the loads run
successfully in the upgraded environment.
Verify the data to ensure that there are no changes and no additional or missing
records.
Implementing changes and testing
If changes are needed, decide where those changes are going to be made. It is
generally advisable to migrate work back from test to an upgraded development
environment. Complete the necessary changes, then migrate forward through test to
production. Assess the changes when the results from the test runs are available. If you

decide to deviate from best practice and make changes in test and migrate them
forward to production, remember that you'll still need to implement the changes in
development. Otherwise, these changes will be re-identified the next time work is
migrated to the test environment.
When you are satisfied with the results of testing, upgrade the other environments by
backing up and restoring the appropriate repositories. Be sure to closely monitor the
Production environment and check the results after the upgrade. Also remember to
archive and remove old repositories from the previous version.
After the Upgrade
Make sure Use Repository Privilege is assigned properly.
Create a team-based environment with deployment groups, labels and/or queries.
Create a server grid to test performance gains.
Start measuring data quality by creating a sample data profile.
If LDAP is in use, associate LDAP users with PowerCenter users.
Install PowerCenter Metadata Reporter and configure the built-in reports for the
PowerCenter repository.
Repository versioning
After upgrading to version 7, you can set the repository to versioned or non-versioned
if the Team-Based Management option has been purchased and is enabled by the
license
Once the repository is set to versioned, it cannot be set back to non-versioned.
Upgrading folder versions
After upgrading to version 7.x, you'll need to remember the following:
There are no more folder versions in version 7.
The folder with the highest version number becomes the current folder.
Other versions of the folders will be folder_<folder_version_number>.
Shortcuts will be created to mappings from the current folder.
Upgrading repository privileges
Version 7 includes a repository privilege called Use Repository Manager, which
enables users to use new features incorporated in version 7. Users with Use Designer
and Use WFM get this new privilege.
Upgrading Pmrep and Pmcmd scripts
No more folder versions for pmrep and pmrepagent scripts
Need to make sure the workflow/session folder names match the upgraded names
Note that pmcmd command structure changes significantly after version 5. Version
5 pmcmd commands will still run in version 7 but is not guaranteed to be
backwards compatible in future versions.

Advanced external procedure transformations

AEPs are upgraded to Custom Transformation - a non-blocking transformation. To use
feature, the procedure must be recompiled. The old DLL/library can be used when
recompilation is not required.
Upgrading XML definitions

Version 7 supports XML schema.
The upgrade removes namespaces and prefixes for multiple namespaces
Circular reference definitions are read-only after the upgrade
Some datatypes are changed in XML definitions by the upgrade
Upgrading transaction control mappings
Version 7 does not support concatenation of pipelines or branches with transaction
control transformations. After the upgrade, fix mappings and re-save.


Assessing the Business Case
Challenge
Developing a solid business case for the project that includes both the tangible and
intangible potential benefits of the project.
Description
The Business Case should include both qualitative and quantitative assessments of the
project.
The Qualitative Assessment portion of the Business Case is based on the Statement
of Problem/Need and the Statement of Project Goals and Objectives (both generated in
Subtask 1.1.1) and focuses on discussions with the project beneficiaries of expected
benefits in terms of problem alleviation, cost savings or controls, and increased
efficiencies and opportunities.
The Quantitative Assessment portion of the Business Case provides specific
measurable details of the proposed project, such as the estimated ROI. This may
involve the following calculations:
Cash flow analysis- Projects positive and negative cash flows for the anticipated
life of the project. Typically, ROI measurements use the cash flow formula to
depict results.
Net present value - Evaluates cash flow according to the long-term value of
current investment. Net present value shows how much capital needs to be
invested currently, at an assumed interest rate, in order to create a stream of
payments over time. For instance, to generate an income stream of $500 per
month over six months at an interest rate of eight percent would require an
investment (i.e., a net present value) of $2,311.44.
Return on investment - Calculates net present value of total incremental cost
savings and revenue divided by the net present value of total costs multiplied by
100. This type of ROI calculation is frequently referred to as return of equity or
return on capital employed.
Payback Period - Determines how much time will pass before an initial capital
investment is recovered.
The following are steps to calculate the quantitative business case or ROI:

Step 1. Develop Enterprise Deployment Map. This is a model of the project phases
over a timeline, estimating as specifically as possible participants, requirements, and
systems involved. A data integration initiative or amendment may require estimating
customer participation (e.g., by department and location), subject area and type of
information/analysis, numbers of users, numbers and complexity of target data
systems (data marts or operational databases, for example) and data sources, types of
sources, and size of data set. A data migration project may require customer
participation, legacy system migrations, and retirement procedures. The types of
estimations vary by project types and goals, It is important to note that the more
details you have for estimations, the more precise your phased solutions will be. The
scope of the project should also be made known in the deployment map.
Step 2. Analyze Potential Benefits. Discussions with representative managers and
users or the Project Sponsor should reveal the tangible and intangible benefits of the
project. The most effective format for presenting this analysis is often a "before" and
"after" format that compares the current situation to the project expectations, Include
in this step, costs that will be avoided from the deployment of this project.
Step 3. Calculate Net Present Value for all Benefits. Information gathered in this
step should help the customer representatives to understand how the expected benefits
will be allocated throughout the organization over time, using the enterprise
deployment map as a guide.
Step 4. Define Overall Costs. Customers need specific cost information in order to
assess the dollar impact of the project. Cost estimates should address the following
fundamental cost components:
Hardware
Networks
RDBMS software
Back-end tools
Query/reporting tools
Internal labor
External labor
Ongoing support
Training
Step 5. Calculate Net Present Value for all Costs. Use either actual cost estimates
or percentage-of-cost values (based on cost allocation assumptions) to calculate costs
for each cost component, projected over the timeline of the enterprise deployment
map. Actual cost estimates are more accurate than percentage-of-cost allocations, but
much more time-consuming. The percentage-of-cost allocation process may be valuable
for initial ROI snapshots until costs can be more clearly predicted.
Step 6. Assess Risk, Adjust Costs and Benefits Accordingly. Review potential
risks to the project and make corresponding adjustments to the costs and/or benefits.
Some of the major risks to consider are:
Scope creep, which can be mitigated by thorough planning and tight project scope
Integration complexity, which may be reduced by standardizing on vendors with
integrated product sets or open architectures
Architectural strategy that is inappropriate

Current support infrastructure may not meet the needs of the project
Conflicting priorities may impact resource availability
Other miscellaneous risks from management or end users who may withhold
project support; from the entanglements of internal politics; and from
technologies that don't function as promised
Unexpected data quality, complexity, or definition issues often are discovered late,
during the course of the project, and can adversely affect effort, cost and
schedule. This can be somewhat mitigated by early source analysis.
Step 7. Determine Overall ROI. When all other portions of the business case are
complete, calculate the project's "bottom line". Determining the overall ROI is simply a
matter of subtracting net present value of total costs from net present value of (total
incremental revenue plus cost savings).


Defining and Prioritizing Requirements
Challenge
Defining and prioritizing business and functional requirements is often accomplished
through a combination of interviews and facilitated meetings (i.e., workshops) between
the Project Sponsor and beneficiaries and the Project Manager and Business Analyst.
Description
The following three steps are key for successfully defining and prioritizing
requirements:
Step 1: Discovery
Gathering business requirements is one of the most important stages of any data
integration project. Business requirements affect virtually every aspect of the data
integration project starting from Project Planning and Management to End-User
Application Specification. They are like a hub that sits in the middle and touches the
various stages (spokes) of the data integration project. There are two basic techniques
for gathering requirements and investigating the underlying operational data:
interviews and facilitated sessions.
Interviews
It is important to conduct pre-interview research before starting the requirements
gathering process. Interviewees can be categorized into business management and
Information Systems (IS) management.
Business Interviewees: Depending on the needs of the project, even though you
may be focused on a single primary business area, it is always beneficial to interview
horizontally to get a good cross functional perspective of the enterprise. This also
provides insight into how extensible your project is across the enterprise. Before you
interview, be sure to develop an interview questionnaire, schedule the interview time
and place, prepare the interviewees by sending a sample agenda. When interviewing
business people it is always important to start with the upper echelons of management
so as to understand the overall vision, assuming you have the business background,
confidence and credibility to converse at those levels. If not adequately prepared, the
safer approach is to interview middle management. If you are interviewing across
multiple teams, you might want to scramble interviews among teams. This way if you

hear different perspectives from finance and marketing, you can resolve the
discrepancies with a scrambled interview schedule. A note to keep in mind is that
business is sponsoring the data integration project and will be the end-users of the
application. They will decide the success criteria of your data integration project and
determine future sponsorship. Questioning during these sessions should include the
following:
What are the target business functions, roles, and responsibilities?
What are the key relevant business strategies, decisions, and processes (in brief)?
What information is important to drive, support, and measure success for those
strategies/processes? What key metrics? What dimensions for those metrics?
What current reporting and analysis is applicable? Who provides it? How is it
presented? How is it used? How can it be improved?
IS interviewees: The IS interviewees have a different flavor than the business user
community. Interviewing the IS team is generally very beneficial because it is
composed of data gurus who deal with the data on a daily basis. They can provide great
insight into data quality issues, help in systematic exploration of legacy source systems,
and understanding business user needs around critical reports. If you are developing a
prototype, they can help get things done quickly and address important business
reports. Questioning during these sessions should include the following:
Request an overview of existing legacy source systems. How does data current
flow from these systems to the users?
What day-to-day maintenance issues does the operations team encounter with
these systems?
Ask for their insight into data quality issues.
What business users do they support? What reports are generated on a daily,
weekly, or monthly basis? What are the current service level agreements for
these reports?
How can the DI project support the IS department needs?
Facilitated Sessions
The biggest advantage of facilitated session is that they provide quick feedback by
gathering all the people from the various teams into a meeting and initiating the
requirements process. You need a facilitator in these meetings to ensure that all the
participants get a chance to speak and provide feedback. During individual (or small
group) interviews with high-level management, there is often focus and clarity of vision
that may be hindered in large meetings.
The biggest challenge to facilitated sessions is matching everyones busy schedules and
actually getting them into a meeting room. However, this part of the process must be
focused and brief or it can become unwieldy with too much time expended just trying to
coordinate calendars among worthy forum participants. Set a time period and target list
of participants with the Project Sponsor, but avoid lengthening the process if some
participants aren't available. The questions asked during facilitated sessions are similar
to the questions asked to business and IS interviewees.
Step 2: Validation and Prioritization

The Business Analyst, with the help of the Project Architect, documents the findings of
the discovery process after interviewing the business and IS management. The next
step is to define the business requirements specification. The resulting Business
Requirements Specification includes a matrix linking the specific business requirements
to their functional requirements. Defining the business requirements is a time
consuming process and should be facilitated by forming a working group team. A
working group team usually consists of business users, business analysts, project
manager, and other individuals who can help to define the business requirements. The
working group should meet weekly to define and finalize business requirements. The
working group helps to:
Design the current state and future state
Identify supply format and transport mechanism
Identify required message types
Develop Service Level Agreement(s), including timings
Identify supply management and control requirements
Identify common verifications, validations, business validations and transformation
rules
Identify common reference data requirements
Identify common exceptions
Produce the physical message specification
At this time also, the Architect develops the Information Requirements Specification to
clearly represent the structure of the information requirements. This document, based
on the business requirements findings, will facilitate discussion of informational details
and provide the starting point for the target model definition.
The detailed business requirements and information requirements should be reviewed
with the project beneficiaries and prioritized based on business need and the stated
project objectives and scope.
Step 3: The Incremental Roadmap
Concurrent with the validation of the business requirements, the Architect begins the
Functional Requirements Specification providing details on the technical requirements
for the project.
As general technical feasibility is compared to the prioritization from Step 2, the Project
Manager, Business Analyst, and Architect develop consensus on a project "phasing"
approach. Items of secondary priority and those with poor near-term feasibility are
relegated to subsequent phases of the project. Thus, they develop a phased, or
incremental, "roadmap" for the project (Project Roadmap).
This is presented to the Project Sponsor for approval and becomes the first "Increment"
or starting point for the Project Plan.


Developing a Work Breakdown Structure (WBS)
Challenge
Developing a comprehensive work breakdown structure (WBS) that clearly depicts all of
the various tasks and subtasks required to complete a project. Because project time
and resource estimates are typically based on the WBS, it is critical to develop a
thorough, accurate WBS.
Description
The WBS is a divide and conquer approach to project management. It is a hierarchical
tree that allows large task to be visualized as a group of related smaller, more
manageable sub-tasks. These task can be more easily monitored and communicated;
they also make identifying accountability a more direct and clear process. The WBS
serves as a starting point for both the project estimate and the project plan.
One challenge in developing a thorough WBS is obtaining the correct balance between
enough detail, and too much detail. The WBS shouldn't be a 'grocery list' of every minor
detail in the project, but it does need to break the tasks down to a manageable level of
detail. One general guideline is to keep task detail to a duration of at least a day. Also,
when naming these task take care that all organizations that will be participating in the
project understand how task are broken down. If department A typically breaks a
certain task up among three groups and department B assigns it to one, there can be
potential issues when tasks are assigned.
It is also important to remember that the WBS is not necessarily a sequential
document. Tasks in the hierarchy are often completed in parallel. At this stage of
project planning, the goal is to list every task that must be completed; it is not
necessary to determine the critical path for completing these tasks. For example, you
may have multiple subtasks under a task (e.g., 4.3.1 through 4.3.7 under task 4.3).
So, although subtasks 4.3.1 through 4.3.4 may have sequential requirements that force
you to complete them in order, subtasks 4.3.5 through 4.3.7 can - and should - be
completed in parallel if they do not have sequential requirements. However, it is
important to remember that a task is not complete until all of its corresponding
subtasks are completed - whether sequentially or in parallel. For example, the Build
phase is not complete until tasks 4.1 through 4.7 are complete, but some work can
(and should) begin for the Deploy phase long before the Build phase is complete.
The Project Plan provides a starting point for further development of the project WBS.
This sample is a Microsoft Project file that has been "pre-loaded" with the phases,

tasks, and subtasks that make up the Informatica Methodology. The Project Manager
can use this WBS as a starting point, but should review it carefully to ensure that it
corresponds to the specific development effort, removing any steps that aren't relevant
or adding steps as necessary. Many projects require the addition of detailed steps to
accurately represent the development effort.
If the Project Manager chooses not to use Microsoft Project, an Excel version of the
Work Breakdown Structure is available. The phases, tasks, and subtasks can be
exported from Excel into many other project management tools, simplifying the effort
of developing the WBS.
After the WBS has been loaded into the selected project management tool and refined
for the specific project needs, the Project Manager can begin to estimate the level of
effort involved in completing each of the steps. When the estimate is complete,
individual resources can be assigned and scheduled. The end result is the Project Plan.
Refer to Developing and Maintaining the Project Plan for further information about the
project plan.


Developing and Maintaining the Project Plan
Challenge
Developing the first-pass of a project plan that incorporates all of the necessary
components but which is sufficiently flexible to accept the inevitable changes.
Description
Use the following steps as a guide for developing the initial project plan:
1. Define the project's major milestones based on the Project Scope.
2. Break the milestones down into major tasks and activities. The Project
Plan should be helpful as a starting point or for recommending tasks for
inclusion.
3. Continue the detail breakdown, if possible, to a level at which tasks are of about
one to three days' duration. This level provides satisfactory detail to facilitate
estimation and tracking. If the detail tasks are too broad in scope, estimates are
much less likely to be accurate.
4. Confer with technical personnel to review the task definitions and effort
estimates (or even to help define them, if applicable).
5. Establish the dependencies among tasks, where one task cannot be started until
another is completed (or must start or complete concurrently with another).
6. Define the resources based on the role definitions and estimated number of
resources needed for each role.
7. Assign resources to each task. If a resource will only be part-time on a task,
indicate this in the plan.
At this point, especially when using Microsoft Project, it is advisable to create
dependencies (i.e., predecessor relationships) between tasks assigned to the same
resource in order to indicate the sequence of that person's activities.
The initial definition of tasks and effort and the resulting schedule should be an exercise
in pragmatic feasibility unfettered by concerns about ideal completion dates. In other
words, be as realistic as possible in your initial estimations, even if the resulting
scheduling is likely to be a hard sell to company management.
This initial schedule becomes a starting point. Expect to review and rework it, perhaps
several times. Look for opportunities for parallel activities, perhaps adding resources, if
necessary, to improve the schedule.

When a satisfactory initial plan is complete, review it with the Project Sponsor and
discuss the assumptions, dependencies, assignments, milestone dates, and such.
Expect to modify the plan as a result of this review.
Reviewing and Revising the Project Plan
Once the Project Sponsor and company managers agree to the initial plan, it becomes
the basis for assigning tasks to individuals on the project team and for setting
expectations regarding delivery dates. The planning activity then shifts to tracking tasks
against the schedule and updating the plan based on status and changes to
assumptions.
One approach is to establish a baseline schedule (and budget, if applicable) and then
track changes against it. With Microsoft Project, this involves creating a "Baseline" that
remains static as changes are applied to the schedule. If company and project
management do not require tracking against a baseline, simply maintain the plan
through updates without a baseline.
Regular status reporting should include any changes to the schedule, beginning with
team members' notification that dates for task completions are likely to change or have
already been exceeded. These status report updates should trigger a regular plan
update so that project management can track the effect on the overall schedule and
budget.
Be sure to evaluate any changes to scope (see 1.2.4 Manage Project and Scope Change
Assessment Sample Deliverable.), or changes in priority or approach, as they arise to
determine if they impact the plan. It may be necessary to modify the plan if changes in
scope or priority require rearranging task assignments or delivery sequences, or if they
add new tasks or postpone existing ones.


Developing the Business Case
Challenge
Identifying the departments and individuals that are likely to benefit directly from the
project implementation. Understanding these individuals, and their business information
requirements, is key to defining and scoping the project.
Description
The following four steps summarize business case development and lay a good
foundation for proceeding into detailed business requirements for the project.
1. One of the first steps in establishing the business scope is identifying the project
beneficiaries and understanding their business roles and project participation. In many
cases, the Project Sponsor can help to identify the beneficiaries and the various
departments they represent. This information can then be summarized in an
organization chart that is useful for ensuring that all project team members understand
the corporate/business organization.
Activity - Interview project sponsor to identify beneficiaries, define their business
roles and project participation.
Deliverable - Organization chart of corporate beneficiaries and participants.
2. The next step in establishing the business scope is to understand the business
problem or need that the project addresses. This information should be clearly defined
in a Problem/Needs Statement, using business terms to describe the problem. For
example, the problem may be expressed as "a lack of information" rather than "a lack
of technology" and should detail the business decisions or analysis that is required to
resolve the lack of information. The best way to gather this type of information is by
interviewing the Project Sponsor and/or the project beneficiaries.
Activity - Interview (individually or in forum) Project Sponsor and/or beneficiaries
regarding problems and needs related to project.
Deliverable - Problem/Need Statement
3. The next step in creating the project scope is defining the business goals and
objectives for the project and detailing them in a comprehensive Statement of Project
Goals and Objectives. This statement should be a high-level expression of the desired
business solution (e.g., what strategic or tactical benefits does the business expect to

gain from the project,) and should avoid any technical considerations at this point.
Again, the Project Sponsor and beneficiaries are the best sources for this type of
information. It may be practical to combine information gathering for the needs
assessment and goals definition, using individual interviews or general meetings to
elicit the information.
Activity - Interview (individually or in forum) Project Sponsor and/or beneficiaries
regarding business goals and objectives for the project.
Deliverable - Statement of Project Goals and Objectives
4. The final step is creating a Project Scope and Assumptions statement that clearly
defines the boundaries of the project based on the Statement of Project Goals and
Objective and the associated project assumptions. This statement should focus on the
type of information or analysis that will be included in the project rather than what will
not.
The assumptions statements are optional and may include qualifiers on the scope, such
as assumptions of feasibility, specific roles and responsibilities, or availability of
resources or data.
Activity -Business Analyst develops Project Scope and Assumptions statement for
presentation to the Project Sponsor.
Deliverable - Project Scope and Assumptions statement


Managing the Project Lifecycle
Challenge
Providing a structure for on-going management throughout the project lifecycle.
Description
It is important to remember that the quality of a project can be directly correlated to
the amount of review that occurs during its lifecycle.
Project Status and Plan Reviews
In addition to the initial project plan review with the Project Sponsor, schedule regular
status meetings with the sponsor and project team to review status, issues, scope
changes and schedule updates.
Gather status, issues and schedule update information from the team one day before
the status meeting in order to compile and distribute the Status Report .
Project Content Reviews
The Project Manager should coordinate, if not facilitate, reviews of requirements, plans
and deliverables with company management, including business requirements reviews
with business personnel and technical reviews with project technical personnel.
Set a process in place beforehand to ensure appropriate personnel are invited, any
relevant documents are distributed at least 24 hours in advance, and that reviews focus
on questions and issues (rather than a laborious "reading of the code").
Reviews may include:
Project scope and business case review
Business requirements review
Source analysis and business rules reviews
Data architecture review
Technical infrastructure review (hardware and software capacity and configuration
planning)
Data integration logic review (source to target mappings, cleansing and
transformation logic, etc.)

Source extraction process review
Operations review (operations and maintenance of load sessions, etc.)
Reviews of operations plan, QA plan, deployment and support plan

Change Management
Directly address and evaluate any changes to the planned project activities, priorities,
or staffing as they arise, or are proposed, in terms of their impact on the project plan.
Use the Scope Change Assessment to record the background problem or
requirement and the recommended resolution that constitutes the potential
scope change.
Review each potential change with the technical team to assess its impact on the
project, evaluating the effect in terms of schedule, budget, staffing
requirements, and so forth.
Present the Scope Change Assessment to the Project Sponsor for acceptance (with
formal sign-off, if applicable). Discuss the assumptions involved in the impact
estimate and any potential risks to the project.
The Project Manager should institute this type of change management process in
response to any issue or request that appears to add or alter expected activities and
has the potential to affect the plan. Even if there is no evident effect on the schedule, it
is important to document these changes because they may affect project direction and
it may become necessary, later in the project cycle, to justify these changes to
management.
Issues Management
Any questions, problems, or issues that arise and are not immediately resolved should
be tracked to ensure that someone is accountable for resolving them so that their effect
can also be visible.
Use the Issues Tracking template, or something similar, to track issues, their owner,
and dates of entry and resolution as well as the details of the issue and of its solution.
Significant or "showstopper" issues should also be mentioned on the status report.
Project Acceptance and Close
Rather than simply walking away from a project when it seems complete, there should
be an explicit close procedure. For most projects this involves a meeting where the
Project Sponsor and/or department managers acknowledge completion or sign a
statement of satisfactory completion.
Even for relatively short projects, use the Project Close Report to finalize the
project with a final status report detailing:
o What was accomplished
o Any justification for tasks expected but not completed
o Recommendations

Prepare for the close by considering what the project team has learned about the
environments, procedures, data integration design, data architecture, and other
project plans.
Formulate the recommendations based on issues or problems that need to be
addressed. Succinctly describe each problem or recommendation and if
applicable, briefly describe a recommended approach.


Using Interviews to Determine Corporate Analytics
Requirements
Challenge
Data warehousing projects are usually initiated out of a business need for a certain type
of reports (i.e., we need consistent reporting of revenue, bookings and backlog).
Except in the case of narrowly-focused, departmental data marts however, this is not
enough guidance to drive a full analytic solution. Further, a successful, single-purpose
data mart can build a reputation such that, after a relatively brief period of proving its
value to users, business management floods the technical group with requests for more
data marts in other areas. The only way to avoid silos of data marts is to think bigger
at the beginning and canvas the enterprise (or at least the department, if thats your
limit of scope) for a broad analysis of analytic requirements.
Description
Determining the analytic requirements in satisfactory detail and clarity is a difficult task
however, especially while ensuring that the requirements are representative of all the
potential stakeholders. This Best Practice summarizes the recommended interview and
prioritization process for this requirements analysis.
Process Steps
The first step in the process is to identify and interview all major sponsors and
stakeholders. This typically includes the executive staff and CFO since they are likely to
be the key decision makers who will depend on the analytics. At a minimum, figure on
10 to 20 interview sessions.
The next step in the process is to interview representative information providers. These
individuals include the decision makers who provide the strategic perspective on what
information to pursue, as well as details on that information, and how it is currently
used (i.e., reported and/or analyzed). Be sure to provide feedback to all of the sponsors
and stakeholders regarding the findings of the interviews and the recommended subject
areas and information profiles. It is often helpful to facilitate a Prioritization Workshop
with the major stakeholders, sponsors, and information providers in order to set
priorities on the subject areas.
Conduct Interviews

The following paragraphs offer some tips on the actual interviewing process. Two
sections at the end of this document provide sample interview outlines for the executive
staff and information providers.
Remember to keep executive interviews brief (i.e., an hour or less) and to the point. A
focused, consistent interview format is desirable. Don't feel bound to the script,
however, since interviewees are likely to raise some interesting points that may not be
included in the original interview format. Pursue these subjects as they come up, asking
detailed questions. This approach often leads to discoveries of strategic uses for
information that may be exciting to the client and provide sparkle and focus to the
project.
Questions to the executives or decision-makers should focus on what business
strategies and decisions need information to support or monitor them. (Refer to Outline
for Executive Interviews at the end of this document). Coverage here is critical if key
managers are left out, you may miss a critical viewpoint and may miss an important
buy-in.
Interviews of information providers are secondary but can be very useful. These are the
business analyst-types who report to decision-makers and currently provide reports and
analyses using Excel or Lotus or a database program to consolidate data from more
than one source and provide regular and ad hoc reports or conduct sophisticated
analysis. In subsequent phases of the project, you must identify all of these
individuals, learn what information they access, and how they process it. At this stage
however, you should focus on the basics, building a foundation for the project and
discovering what tools are currently in use and where gaps may exist in the analysis
and reporting functions.
Be sure to take detailed notes throughout the interview process. If there are a lot of
interviews, you may want the interviewer to partner with someone who can take good
notes, perhaps on a laptop to save note transcription time later. It is important to take
down the details of what each person says because, at this stage, it is difficult to know
what is likely to be important. While some interviewees may want to see detailed notes
from their interviews, this is not very efficient since it takes time to clean up the notes
for review. The most efficient approach is to simply consolidate the interview notes into
a summary format following the interviews.
Be sure to review previous interviews as you go through the interviewing process, You
can often use information from earlier interviews to pursue topics in later interviews in
more detail and with varying perspectives.
The executive interviews must be carried out in business terms. There can be no
mention of the data warehouse or systems of record or particular source data entities
or issues related to sourcing, cleansing or transformation, It is strictly forbidden to
use any technical language. It can be valuable to have an industry expert prepare and
even accompany the interviewer to provide business terminology and focus. If the
interview falls into technical details, for example, into a discussion of whether certain
information is currently available or could be integrated into the data warehouse, it is
up to the interviewer to re-focus immediately on business needs. If this focus is not
maintained, the opportunity for brainstorming is likely to be lost, which will reduce the
quality and breadth of the business drivers.

Because of the above caution, it is rarely acceptable to have IS resources present at
the executive interviews. These resources are likely to engage the executive (or vice
versa) in a discussion of current reporting problems or technical issues and thereby
destroy the interview opportunity.
Keep the interview groups small. One or two Professional Services personnel should
suffice with at most one client project person. Especially for executive interviews, there
should be one interviewee. There is sometimes a need to interview a group of middle
managers together, but if there are more than two or three, you are likely to get much
less input from the participants.
Distribute Interview Findings and Recommended Subject Areas
At the completion of the interviews, compile the interview notes and consolidate the
content into a summary.This summary should help to breakout the input into
departments or other groupings significant to the client. Use this content and your
interview experience along with best practices or industry experience to recommend
specific, well-defined subject areas.
Remember that this is a critical opportunity to position the project to the decision-
makers by accurately representing their interests while adding enough creativity to
capture their imagination. Provide them with models or profiles of the sort of
information that could be included in a subject area so they can visualize its utility. This
sort of visionary concept of their strategic information needs is crucial to drive their
awareness and is often suggested during interviews of the more strategic thinkers. Tie
descriptions of the information directly to stated business dri vers (e.g., key processes
and decisions) to further accentuate the business solution.
A typical table of contents in the initial Findings and Recommendations document might
look like this:
I. Introduction
II. Executive Summary
A. Objectives for the Data Warehouse
B. Summary of Requirements
C. High Priority Information Categories
D. Issues
III. Recommendations
A. Strategic Information Requirements
B. Issues Related to Availability of Data
C. Suggested Initial Increments
D. Data Warehouse Model
IV. Summary of Findings
A. Description of Process Used
B. Key Business Strategies [this includes descriptions of processes, decisions,
other drivers)
C. Key Departmental Strategies and Measurements
D. Existing Sources of Information
E. How Information is Used
F. Issues Related to Information Access
V. Appendices
A. Organizational structure, departmental roles

B. Departmental responsibilities, and relationships

Conduct Prioritization Workshop
This is a critical workshop for consensus on the business drivers. Key executives and
decision-makers should attend, along with some key information providers. It is
advisable to schedule this workshop offsite to assure attendance and attention, but the
workshop must be efficient typically confined to a half-day.
Be sure to announce the workshop well enough in advance to ensure that key
attendees can put it on their schedules. Sending the announcement of the workshop
may coincide with the initial distribution of the interview findings.
The workshop agenda should include the following items:
Agenda and Introductions
Project Background and Objectives
Validate Interview Findings: Key Issues
Validate Information Needs
Reality Check: Feasibility
Prioritize Information Needs
Analytics Plan
Wrap-up and Next Steps
Keep the presentation as simple and concise as possible, and avoid technical
discussions or detailed sidetracks.
Validate information needs
Key business drivers should be determined well in advance of the workshop, using
information gathered during the interviewing process. Prior to the workshop, these
business drivers should be written out, preferably in display format on flipcharts or
similar presentation media, along with relevant comments or additions from the
interviewees and/or workshop attendees.
During the validation segment of the workshop, attendees need to review and discuss
the specific types of information that have been identified as important for triggering or
monitoring the business drivers. At this point, it is advisable to compile as complete a
list as possible; it can be refined and prioritized in subsequent phases of the project.
As much as possible, categorize the information needs by function, maybe even by
specific driver (i.e., a strategic process or decision). Considering the information needs
on a function by function basis fosters discussion of how the information is used and by
whom.
Reality check: feasibility
With the results of brainstorming over business drivers and information needs listed (all
over the walls, presumably), take a brief detour into reality before prioritizing and
planning. You need to consider overall feasibility before establishing the first priority

information area(s) and setting a plan to implement the data warehousing solution with
initial increments to address those first priorities.
Briefly describe the current state of the likely information sources (SORs). What
information is currently accessible with a reasonable likelihood of the quality and
content necessary for the high priority information areas? If there is likely to be a high
degree of complexity or technical difficulty in obtaining the source information, you may
need to reduce the priority of that information area (i.e., tackle it after some successes
in other areas).
Avoid getting into too much detail or technical issues. Describe the general types of
information that will be needed (e.g., sales revenue, service costs, customer descriptive
information, etc.), focusing on what you expect will be needed for the highest priority
information needs.
Analytics plan
The project sponsors, stakeholders, and users should all understand that the process of
implementing the data warehousing solution is incremental.. Develop a high-level plan
for implementing the project, focusing on increments that are both high-value and
high-feasibility. Implementing these increments first provides an opportunity to build
credibility for the project. The objective during this step is to obtain buy-in for your
implementation plan and to begin to set expectations in terms of timing. Be practical
though; don't establish too rigorous a timeline!
Wrap-up and next steps
At the close of the workshop, review the group's decisions (in 30 seconds or less),
schedule the delivery of notes and findings to the attendees, and discuss the next steps
of the data warehousing project.
Document the Roadmap
As soon as possible after the workshop, provide the attendees and other project
stakeholders with the results:
Definitions of each subject area, categorized by functional area
Within each subject area, descriptions of the business drivers and information
metrics
Lists of the feasibility issues
The subject area priorities and the implementation timeline.
Outline for Executive Interviews
I. Introductions
II. General description of information strategy process
A. Purpose and goals
B. Overview of steps and deliverables
Interviews to understand business information strategies and
expectations

Document strategy findings
Consensus-building meeting to prioritize information requirements
and identify quick hits
Model strategic subject areas
Produce multi-phase Business Intelligence strategy
III. Goals for this meeting
A. Description of business vision, strategies
B. Perspective on strategic business issues and how they drive information
needs
Information needed to support or achieve business goals
How success is measured
IV. Briefly describe your roles and responsibilities?
The interviewee may provide this information before the actual interview. In this
case, simply review with the interviewee and ask if there is anything to add.
A. What are your key business strategies and objectives?
How do corporate strategic initiatives impact your group?
These may include MBOs (personal performance objectives), and
workgroup objectives or strategies.
B. What do you see as the Critical Success Factors for an Enterprise
Information Strategy?
What are its potential obstacles or pitfalls?
C. What information do you need to achieve or support key decisions related
to your business objectives?
D. How will your organizations progress and final success be measured
(e.g., metrics, critical success factors)?
E. What information or decisions from other groups affect your success?
F. What are other valuable information sources (i.e., computer reports,
industry reports, email, key people, meetings, phone)?
G. Do you have regular strategy meetings? What information is shared as
you develop your strategy?
H. If it is difficult for the interviewee to brainstorm about information needs,
try asking the question this way: "When you return from a two-week
vacation, what information do you want to know first?"
I. Of all the information you now receive, what is the most valuable?
J. What information do you need that is not now readily available?
K. How accurate is the information you are now getting?
L. To whom do you provide information?
M. Who provides information to you?
N. Who would you recommend be involved in the cross-functional
Consensus Workshop?
Outline for Information Provider Interviews

I. Introductions
II. General description of information strategy process
A. Purpose and goals
B. Overview of steps and deliverables
Interviews to understand business information strategies and
expectations
Document strategy findings and model the strategic subject areas
Consensus-building meeting to prioritize information requirements
and identify quick hits
Produce multi-phase Business Intelligence strategy
III. Goals for this meeting
A. Understanding of how business issues drive information needs
B. High-level understanding of what information is currently provided to whom
Where does it come from
How is it processed
What are its quality or access issues
IV. Briefly describe your roles and responsibilities?
The interviewee may provide this information before the actual interview. In this
case, simply review with the interviewee and ask if there is anything to add.
A. Who do you provide information to?
B. What information do you provide to help support or measure the
progress/success of their key business decisions?
C. Of all the information you now provide, what is the most requested or
most widely used?
D. What are your sources for the information (both in terms of systems and
personnel)?
E. What types of analysis do you regularly perform (i.e., trends,
investigating problems)? How do you provide these analyses (e.g.,
charts, graphs, spreadsheets)?
F. How do you change/add value to the information?
G. Are there quality or usability problems with the information you work
with? How accurate is it?


PowerExchange Installation (for Mainframe)
Challenge
Installing and configuring PowerExchange on a mainframe, ensuring that the process is
both efficient and effective.
Description
PowerExchange installation is very straight-forward and can generally be accomplished
in a timely fashion. When considering a PowerExchange installation, be sure that the
appropriate resources are available. These include, but are not limited to:
MVS systems operator
Appropriate database administrator; this depends on what (if any) databases are
going to be sources/and or targets (e.g., IMS, IDMS, etc.).
MVS Security resources
Be sure to follow the sequence of the following steps to successfully install
PowerExchange. Note that in this very typical scenario, the mainframe source data is
going to be pulled across to a server box.
1. Complete the PowerExchange pre-install checklist and obtain valid license keys.
2. Install PowerExchange on the mainframe.
3. Start the PowerExchange jobs/tasks on the mainframe.
4. Install the PowerExchange client (Navigator) on a workstation.
5. Test connectivity to the mainframe from the workstation.
6. Install PowerExchange on the UNIX/NT server.
7. Test connectivity to the mainframe from the server.
Complete the PowerExchange Pre-install Checklist and Obtain Valid License
Keys
This is a prerequisite. Reviewing the environment and recording the information in this
detailed checklist facilitates the PowerExchange install. The checklist can be found in
the Velocity appendix. Be sure to complete all relevant sections.
You will need a valid license key in order to run any of the PowerExchange components.
This is a 44-byte key that uses hyphens every 4 bytes. For example:

1234-ABCD-1234-EF01-5678-A9B2-E1E2-E3E4-A5F1
The key is not case-sensitive and uses hexadecimal digits and letters (0-9 and A-F).
Keys are valid for a specific time period and are also linked to an exact or generic
TCP/IP address. They also control access to certain databases and determine if the
PowerCenter Mover can be used. You cannot successfully install PowerExchange without
a valid key for all required components.
Note: When copying software from one machine to another, you may encounter license
key problems since the license key is IP specific. Be prepared to deal with this
eventuality, especially if you are going to a backup site for disaster recovery testing.
Install PowerExchange on the Mainframe
Step 1: Create a folder c:\Detail on the workstation. Copy file
DETAIL_V5xx\software\MVS\ dtlosxx.v5xx from the PowerExchange CD to this
directory. Double click the file to unzip its contents to c:\Detail folder.

Step 2: Create a PDS HLQ.DTLV5xx.RUNLIB on the mainframe in order to pre-
allocate the Detail library. Ensure sufficient space for the required jobs/tasks by setting
the Cylinders to 150.
Step 3: Run the MVS_Install file in the c:\Detail folder. This displays the MVS Install
Assistant (as shown below). Configure the IP Address, Logon ID, Password, HLQ, and
Default volume setting on the display screen. Also, enter the license key.


Click the Custom buttons to configure the desired data sources.
Be sure that the HLQ on this screen matches the HLQ of the allocated RUNLIB (from
step 2).
Save these settings and click Process. This creates the JCL libraries and opens the
following screen to FTP these libraries to MVS. Click XMIT to complete the FTP process.


Step 4: Edit JOBCARD in RUNLIB and configure as per the environment (e.g., execution
class, message class, etc.)
Step 5: Edit the SETUP member in RUNLIB. Copy in the JOBCARD and SUBMIT. This
process can submit from 5 to 24 jobs. All jobs should end with return code 0 (success).
Step 6: If implementing change capture, APF authorize the .LOAD and the .LOADLIB
libraries. This is required for external security and change capture only.
Step 7: If implementing change capture, copy the Agent from the PowerExchange
PROCLIB to the system site PROCLIB. In addition, when the Agent has been started,
run job SETUP2 (for change capture only).
Start The PowerExchange Jobs/Tasks on the Mainframe
The installed PowerExchange Listener can be run as a normal batch job or as a started
task. Informatica recommends that it initially be submitted as a batch job:
RUNLIB(STARTLST)
It should return: DTL-00607 Listener VRM 5.x.x Build V5xx_P0x started.
If implementing change capture, start the PowerExchange Agent (as a started task):
/S DTLA
It should return: DTLEDMI1722561: EDM Agent DTLA has completed initialization.

Install The PowerExchange Client (Navigator) on a Workstation
Step 1: Run file \DETAIL_V5xx\software\Windows\detail_pc_v5xx.exe on the DETAIL
installation CD and follow the prompts.
Step 2: Enter the license key.
Step 3: Follow the wizard to complete the install and reboot the machine.
Step 4: Add a Node entry to the configuration file \Program
Files\Striva\DETAIL\dbmover.cfg to point to the Listener on the mainframe.
node = (mainframe location name, TCPIP, mainframe IP address, 2480)
Test Connectivity to the Mainframe from the Workstation
Ensure communication to the PowerExchange Listener on the mainframe by entering
the following in DOS on the workstation:
DTLREXE PROG=PING LOC=mainframe location
It should return: DLT-00755 DTLREXE Command OK!
Install PowerExchange on the UNIX Server
Step 1: Create a user for the PowerExchange installation on the UNIX box.
Step 2: Create a UNIX directory /opt/inform/dtlv5xxp0x.
Step 3: FTP the file \DETAIL_V5xx\software\Unix\dtlxxx_v5xx.tar on the DETAIL
installation CD to the DETAIL installation directory on UNIX.
Step 4: Use the UNIX tar command to extract the files. The command is tar xvf
dtlxxx_v5xx.tar.
Step 5: Update the logon profile with the correct path, library path, and DETAIL_HOME
environment variables.
Step 6: Update the license key file on the server.
Step 7: Update the configuration file on the server (dbmover.cfg) by adding a Node
entry to point to the Listener on the mainframe.
Step 8: If using an ETL tool in conjunction with PowerExchange, via ODBC, update the
odbc.ini file on the server by adding data source entries that point to PowerExchange-
accessed data:
[striva_mvs_db2]
DRIVER=<DETAIL install dir>/libdtlodbc.so

DESCRIPTION=MVS DB2
DBTYPE=db2
LOCATION=mvs1
DBQUAL1=DB2T
Test Connectivity to the Mainframe from the Server
Ensure communication to the PowerExchange Listener on the mainframe by entering
the following on the UNIX server:
DTLREXE PROG=PING LOC=mainframe location
It should return: DLT-00755 DTLREXE Command OK!


Running Sessions in Recovery Mode
Challenge
Use the Load Manager architecture for manual error recovery, by suspending and
resuming the workflows and worklets when an error is encountered.
Description
When a task in the workflow fails at any point, one option is to truncate the target and
run the workflow again from the beginning. Load Manager architecture offers an
alternative to this scenario: the workflow can be suspended and the user can fix the
error rather than re-processing the portion of the workflow with no errors. This option,
"Suspend on Error", results in accurate and complete target data, as if the session
completed successfully with one run.
Configure Mapping for Recovery
For consistent recovery, the mapping needs to produce the same result, and in the
same order, in the recovery execution as in the failed execution. This can be achieved
by sorting the input data using either the sorted ports option in Source Qualifier (or
Application Source Qualifier) or by using a sorter transformation with distinct rows
option immediately after source qualifier transformation. Additionally, ensure that all
the targets received data from transformations that produce repeatable data.
Configure Session for Recovery
Enable the session for recovery by setting the enable recovery option in the Config
Object tab of Session Properties.


For consistent data recovery, the session properties for the recovery session must be
the same as the session properties of the failed session.
Configure Workflow for Recovery
The Suspend on Error option directs the PowerCenter Server to suspend the workflow
while the user fixes the error, and then resumes the workflow.


The server suspends the workflow when any of the following tasks fail:
Session
Command
Worklet
Email
Timer
If any of the above tasks fail during the execution of a workflow, execution suspends at
the point of failure. The PowerCenter Server does not evaluate the outgoing links from
the task. If no other task is running in the workflow, the Workflow Monitor displays a
status of Suspended for the workflow. However, if other tasks are being executed in
the workflow when a task fails, the workflow is considered partially suspended or
partially running and the Workflow Monitor displays the status as Suspending.
When a user discovers that a workflow is either partially or completely suspended, he
or she can fix the cause of the error(s). The workflow will then resume execution from
the point of suspension, with the PowerCenter Server running the resumed actions as if
they never ran.
The following table lists the possible combinations for suspend and resume.
SUSPEND/RESUME Scenarios:
Resumeworkflow Resumeworklet
Startworkflow Runs the whole workflow Runs the whole workflow

Startworkflow from Runs the whole workflow
from specified task
Runs the whole workflow
from specified worklet
Starttask Runs only the suspended
task (workflow task)
Runs only the
suspended task
(worklet task)
Truncate Target Table
If the truncate table option is enabled in a recovery enabled session, the target table
will not be truncated during recovery process.
Session Logs
In a suspended workflow scenario, the PowerCenter Server uses the existing session
log when it resumes the workflow from the point of suspension. However, the earlier
runs that caused the suspension are recorded in the historical run information in the
repository.
Suspension Email
The workflow can be configured to send an email when the PowerCenter Server
suspends the workflow. When a task fails, the server suspends the workflow and sends
the suspension email. The user can then fix the error and resume the workflow. If
another task fails while the PowerCenter Server is suspending the workflow, the server
does not send another suspension email. The server only sends out another suspension
email if another task fails after the workflow resumes. Check the "Browse Emails"
button on the General tab of the Workflow Designer Edit sheet to configure the
suspension email.


Suspending Worklets
When the "Suspend On Error" option is enabled for the parent workflow, the
PowerCenter Server also suspends the worklet if a task within the worklet fails. When a
task in the worklet fails, the server stops executing the failed task and other tasks in its
path. If no other task is running in the worklet, the status of the worklet is
"Suspended". If other tasks are still running in the worklet, the status of the worklet is
"Suspending". The parent workflow is also suspended when the worklet is "Suspended"
or "Suspending".

Assume that the suspension always occurs in the worklet and you issue a Resume
command after the error is fixed. The following table describes various suspend and

resume scenarios with reference to the diagram above. Note that the worklet contains a
start task and session3:
Initial Command Resume Workflow Resume Worklet
Startworkflow Workflow
Session1 and Session2
run. Worklet runs and
suspends. Session4 does
not run.
Runs Worklet and
Session4
Runs Worklet and
Session4
Startworkflow from
Worklet
Worklet runs and
suspends. Session1,
Session2, and Session4
do not run.
Runs Worklet and
Session4
Runs Worklet and
Session4
Starttask
Worklet runs and
suspends. No other tasks
run.
Runs only the suspended
task (Worklet)
Runs only the suspended
task (Worklet)
Starting Recovery
The recovery process can be started using Workflow Manager Client tool or Workflow
Monitor client tool. Alternatively, the recovery process can be started using pmcmd in
command line mode or using a script.
Recovery Tables and Recovery Process
When sessions are enabled for recovery, the PowerCenter Server creates two tables
(PM_RECOVERY and PM_TGT_RUN_ID) at the target database. During regular session
runs, the server updates these tables with target load status. The session will fail, if
the PowerCenter Server cannot create these tables due to insufficient privileges. Once
they are created, these tables will be re-used.
When a session is run in recovery mode, the PowerCenter Server uses the information
in these tables to determine the point of failure, and continues to load target data from
that point. If the recovery tables (PM_RECOVERY and PM_TGT_RUN_ID) are not
present in the target repository, the recovery session will fail.
Unrecoverable Sessions
The following session configurations are not supported by PowerCenter for session
recovery:
Sessions using partitioning other than pass-through partitioning.
Sessions using database partitioning.
Recovery using debugger.

Test load using a recovery enabled session.

Inconsistent Data During Recovery Process
For recovery to be effective, the recovery session must produce the same set of rows
and in the same order. Any change after initial failure in mapping, session and/or in
the server that changes the ability to produce repeatable data will result in
inconsistent data during recovery process.
The following cases may produce inconsistent data during a recovery session:
Session performs incremental aggregation and server stops unexpectedly.
Mapping uses sequence generator transformation.
Mapping uses a normalizer transformation.
Source and/or target changes after initial session failure.
Data movement mode change after initial session failure.
Code page (server, source or target) changes, after initial session failure.
Mapping changed in a way that causes server to distribute or filter or aggregate
rows differently.
Session configurations are not supported by PowerCenter for session recovery.
Mapping uses a lookup table and the data in the look up table changes between
session runs.
Session sort order changes, when server is running in Unicode mode.

Complex Mappings and Recovery
In the case of complex mappings that load to more than one target that are related
(i.e., primary key foreign key relationship), the session failure and subsequent
recovery may lead to data integrity issues. In such cases, it is necessary to check the
integrity of the target tables to be checked and fixed prior to starting the recovery
process.


Configuring Security
Challenge
Configuring a PowerCenter security scheme to prevent unauthorized access to
mappings, folders, sessions, workflows, repositories, and data in order to ensure
system integrity and data confidentiality.
Description
Configuring security is one of the most important components of building a data
warehouse. Determining an optimal security configuration for a PowerCenter
environment requires a thorough understanding of business requirements, data
content, and end-users access requirements. Knowledge of PowerCenter's security
functionality and facilities is also a prerequisite to security design.
Implement security with the goals of easy maintenance and scalability. When
establishing repository security, keep it simple. Although PowerCenter includes the
utilities for a complex web of security, the more simple the configuration, the easier it is
to maintain. Securing the PowerCenter environment involves the following basic
principles:
Create users and groups
Define access requirements
Grant privileges and permissions
Before implementing security measures, ask and answer the following questions:
Who will administer the repository?
How many projects need to be administered? Will the administrator be able to
manage security for all PowerCenter projects or just a select few?
How many environments will be supported in the repository?
Who needs access to the repository? What do they need the ability to do?
How will the metadata be organized in the repository? How many folders will be
required?
Where can we limit repository privileges by granting folder permissions instead?
Who will need Administrator or Super User-type access?
After you evaluate the needs of the repository users, you can create appropriate user
groups, assign repository privileges and folder permissions. In most implementations,

the administrator takes care of maintaining the repository. Limit the number of
administrator accounts for PowerCenter. While this concept is important in a
development/unit test environment, it is critical for protecting the production
environment.
Repository Security Overview
A security system needs to properly control access to all sources, targets, mappings,
reusable transformations, tasks and workflows in both the test and production
repositories. A successful security model needs to support all groups in the project
lifecycle and also consider the repository structure.
Informatica offers multiple layers of security, which enables you to customize the
security within your data warehouse environment. Metadata level security controls
access to PowerCenter repositories, which contain objects grouped by folders. Access to
metadata is determined by the privileges granted to the user or to a group of users and
the access permissions granted on each folder. Some privileges do not apply by folder,
as they are granted by privilege alone (i.e., repository-level tasks).
Just beyond PowerCenter authentication is the connection to the repository database.
All client connectivity to the repository is handled by the PowerCenter Repository Server
and Repository Agent over a TCP/IP connection. The particular database account and
password is specified at installation and during the configuration of the Repository
Server.
Other forms of security available in PowerCenter include permissions for connections.
Connections include database, FTP, and external loader connections. These
permissions are useful when you want to limit access to schemas in a relational
database and can be set-up in the Workflow Manager when source and target
connections are defined.
Occasionally, you may want to restrict changes to source and target definitions in the
repository. A common way to approach this security issue is to use shared folders,
which are owned by an Administrator or Super User. Granting read access to
developers on these folders allows them to create read-only copies in their work
folders.
Informatica Security Architecture
The following diagram, Informatica PowerCenter Security, depicts PowerCenter
security, including access to the repository, Repository Server, PowerCenter Server and
the command-line utilities pmrep and pmcmd.
As shown in the below diagram, the repository server is the central component when
using default security. It sits between the PowerCenter repository and all client
applications, including GUI tools, command line tools and the PowerCenter server.
Each application must be authenticated against metadata stored in several tables within
the repository. The repository server requires one database account which all security
data will be stored as part of its metadata. This is a second layer of security which only
the repository server will use to access. It will authenticate all client applications
against this metadata.


Repository server security
Connection to the PowerCenter repository database is one level of security. All client
connectivity to the repository is handled by the Repository Server and Repository Agent
over a TCP/IP connection. The Repository Server process is installed in a Windows or
UNIX environment, typically on the same physical server as the PowerCenter Server. It
can be installed under the same or different operating system account as the
PowerCenter Server.
When the Repository Server is installed, the database connection information is entered
for the metadata repository. At this time you need to know the database user id and
password to access the metadata repository. The database user id must be able to read
and write to all tables in the database. As a developer creates, modifies, executes
mappings and sessions, this information is continuously updating the metadata in the
repository. Actual database security should be controlled by the DBA responsible for
that database, in conjunction with the PowerCenter Repository Administrator. After the
Repository Server is installed and started, all subsequent client connectivity is

automatic. The users are simply prompted for the name of the Repository Server and
host name. The database id and password are transparent at this point.
PowerCenter server security
Like the Repository Server, the PowerCenter Server communicates with the metadata
repository when it executes workflows or when users are using Workflow Monitor.
During configuration of the PowerCenter Server, the repository database is identified
with the appropriate user id and password to use. This information is specified in the
PowerCenter configuration file (pmserver.cfg). Connectivity to the repository is made
using native drivers supplied by Informatica.
Certain permissions are also required to use the command line utilities pmrep and
pmcmd.
Connection Object Permissions
Within Workflow Manager, you can grant read, write, and execute permissions to
groups and/or users for all types of connection objects. This controls who can create,
view, change, and execute workflow tasks that use those specific connections,
providing another level of security for these global repository objects.
Users with User Workflow Manager can create and modify connection objects.
Connection objects allow the PowerCenter server to read and write to source and target
databases. Any database the server will access will require a connection definition. As
shown below, connection information is stored in the repository. Users executing
workflows will require execution permission on all connections used by the workflow.
The PowerCenter server looks up the connection information in the repository, and
verifies permission for the required action. If permissions are properly granted, the
server will read and write to the defined databases as defined by the workflow.

Users
Users are the fundamental objects of security in a PowerCenter environment. Each
individual logging into the PowerCenter repository should have a unique user account.
Informatica does not recommend creating shared accounts; Unique accounts should be
created for each user. Each repository user needs a user name and password to access
the repository, which should be provided by the PowerCenter Repository Administrator.
Users are created and managed through Repository Manager. Users should change their
passwords from the default immediately after receiving the initial user id from the
Administrator. Passwords can be reset by the user if they are granted the privilege Use
Repository Manager.
When you create the repository, the repository automatically creates two default users:
Administrator. The default password for Administrator is Administrator.
Database user. The username and password used when you created the
repository.
These default users are in the Administrators user group, with full privileges within the
repository. They cannot be deleted from the repository, nor have their group affiliation
changed.
To administer repository users, you must have one of the following privileges:
Administer Repository
Super User
LDAP (Lightweight Directory Access Protocol)
In addition to default repository user authentication, LDAP can be used to authenticate
users. Using LDAP authentication, the repository maintains an association between the
repository user and the external login name. When a user logs into the repository, the
security module authenticates the user name and password against the external
directory. The repository maintains a status for each user. Users can be enabled or
disabled by modifying this status.
Prior to implementing LDAP, the administrator must know:
Repository server username and password
An administrator or superuser user name and password for the repository
An external login name and password
Configuring LDAP
Edit ldap_authen.xml, modify the following attributes:
o NAME the .dll that implements the authentication
o OSTYPE Host operating system
Register ldap_authen.xml in the Repository Server Administration Console.
In the Repository Server Administration Console, configure the authentication
module.


User Groups
When you create a repository, the Repository Manager creates two repository user
groups. These two groups exist so you can immediately create users and begin
developing repository objects.
The default repository user groups are:
Administrators
Public
The Administrators group has super user access. The Public group has a subset of
default repository privileges. These groups cannot be deleted from the repository nor
have their configured privileges changed.
You should create custom user groups to manage users and repository privileges
effectively. The number and types of groups that you create should reflect the needs of
your development teams, administrators, and operations group. Informatica
recommends minimizing the number of custom user groups that you create in order to
facilitate the maintenance process.
A starting point is to create a group for each type of combination of privileges needed
to support the development cycle and production process. This is the recommended
method for assigning privileges. After creating a user group, you assign a set of
privileges for that group. Each repository user must be assigned to at least one user
group. When you assign a user to a group, the user:
Receives all group privileges.
Inherits any changes to group privileges.
Loses and gains privileges if you change the user group membership.
You can also assign users to multiple groups, which grants the user the privileges of
each group. Use the Repository Manager to create and edit repository user groups.
Folder Permissions
When you create or edit a folder, you define permissions for the folder. The
permissions can be set at three different levels:
1. owner
2. owners group
3. repository - remainder of users within the repository.
o First, choose an owner (i.e., user) and group for the folder. If the owner
belongs to more than one group, you must select one of the groups
listed.
o Once the folder is defined and the owner is selected, determine what level
of permissions you would like to grant to the users within the group.
o Then determine the permission level for the remainder of the repository
users.

The permissions that can be set include: read, write, and execute. Any combination of
these can be granted to the owner, group or repository.
Be sure to consider folder permissions very carefully. They offer the easiest way to
restrict users and/or groups from having access to folders or restricting access to
folders. The following table gives some examples of folders, their type, and
recommended ownership.
Folder Name Folder Type Proposed Owner
DEVELOPER_1 Initial development,
temporary work area, unit
test
Individual developer
DEVELOPMENT Integrated development Development lead,
Administrator or Super User
UAT Integrated User Acceptance
Test
UAT lead, Administrator or
Super User
PRODUCTION Production Administrator or Super User
PRODUCTION SUPPORT Production fixes and
upgrades
Production support lead,
Administrator or Super User
Repository Privileges
Repository privileges work in conjunction with folder permissions to give a user or
group authority to perform tasks. Repository privileges are the most granular way of
controlling a users activity. Consider the privileges that each user group requires, as
well as folder permissions, when determining the breakdown of users into groups.
Informatica recommends creating one group for each distinct combination of folder
permissions and privileges.
When you assign a user to a user group, the user receives all privileges granted to the
group. You can also assign privileges to users individually. When you grant a privilege
to an individual user, the user retains that privilege even if his or her user group
affiliation changes. For example, you have a user in a Developer group who has limited
group privileges, and you want this user to act as a backup administrator when you are
not available. For the user to perform every task in every folder in the repository, and
to administer the PowerCenter Server, the user must have the Super User privilege.
For tighter security, grant the Super User privilege to the individual user, not the entire
Developer group. This limits the number of users with the Super User privilege, and
ensures that the user retains the privilege even if you remove the user from the
Developer group.
The Repository Manager grants a default set of privileges to each new user and group
for working within the repository. You can add or remove privileges from any user or
group except:
Administrators and Public (the default read-only repository groups)
Administrator and the database user who created the repository (the users
automatically created in the Administrators group)
The Repository Manager automatically grants each new user and new group the default
privileges. These privileges allow you to perform basic tasks in Designer, Repository

Manager, Workflow Manager, and Workflow Monitor. The following table lists the default
repository privileges:

Default Repository Privileges
Default
Privilege
Folder
Permission
Connection
Object
Permission
Grants the Ability to
Use Designer N/A N/A
Connect to the repository using the
Designer.
Configure connection information.
Read N/A
View objects in the folder.
Change folder versions.
Create shortcuts to objects in the
folder.
Copy objects from the folder.
Export objects.
Read/Write N/A
Create or edit metadata.
Create shortcuts from shared folders.
Copy objects into the folder.
Import objects.
Browse
Repository
N/A N/A
Repository Manager.
Add and remove reports.
Import, export, or remove the
registry.
Search by keywords.
Change your user password.
Read N/A
View dependencies.
Unlock objects, versions, and folders
locked by your username.
Edit folder properties for folders you
own.
Copy a version. (You must also have
Administer Repository or Super User
privilege in the target repository and
write permission on the target
folder.)
Copy a folder. (You must also have
Administer Repository or Super User
privilege in the target repository.)
Use
Workflow
Manager
N/A N/A
Workflow Manager.
Create database, FTP, and external
loader connections in the Workflow

Default
Privilege
Folder
Permission
Connection
Object
Permission
Manager.
Run the Workflow Monitor.
N/A Read/Write
Edit database, FTP, and external
loader connections in the Workflow
Manager.
Read N/A
Export sessions.
View workflows.
View sessions.
View tasks.
View session details and session
performance details.
Read/Write N/A
Create and edit workflows and tasks.
Import sessions.
Validate workflows and tasks.
Read/Write Read
Create and edit sessions.
Read/Execute N/A
View session log.
Read/Execute Execute
Schedule or unschedule workflows.
Start workflows immediately.
Execute N/A
Restart workflow.
Stop workflow.
Abort workflow.
Resume workflow.
Use
Repository
Manager
N/A N/A
Remove label references.
Write
Deployment
group
Delete from deployment group.
Write Folder
Change objects version comments if
not owner.
Change status of object.
Check in.
Check out/undo check-out.
Delete objects from folder.
Mass validation (needs write
permission if options selected).
Recover after delete.

Default
Privilege
Folder
Permission
Connection
Object
Permission
Read Folder
Export objects.
Read/Write
Folder
Deployment
Groups
Add to deployment group.
Read/Write
Original folders
Target folder
Copy objects.
Import objects.

Read/Write/
Execute
Folder Label
Apply label
Extended Privileges
In addition to the default privileges listed above, Repository Manager provides extended
privileges that you can assign to users and groups. These privileges are granted to the
Administrator group by default. The following table lists the extended repository
privileges:
Extended Repository Privileges
Extended
Privilege
Folder
Permission
Connection
Object
Permission
Workflow
Operator
N/A N/A
Connect to the Informatica Server.
Execute N/A
Restart workflow.
Stop workflow.
Abort workflow.
Resume workflow.
Execute Execute
Use pmcmd to start workflows in
folders for which you have execute
permission.
Read Execute
Start workflows immediately.
Read N/A
Schedule and unschedule workflows.
View the session log.
View session details and
performance details.
Administer
Repository
N/A N/A
Repository Manager.
Connect to the Repository Server.
Create, upgrade, backup, delete,
and restore the repository.

Extended Repository Privileges
Extended
Privilege
Folder
Permission
Connection
Object
Permission
Start, stop, enable, disable, and
check the status of the repository.
Manage passwords, users, groups,
and privileges.
Manage connection object
permissions.
Read Read
Copy a folder within the same
repository.
Copy a folder into a different
repository when you have
Administer Repository privilege on
the destination repository.
Read N/A
Edit folder properties.
Read/Write N/A
Copy a folder into the repository.
Administer
Server
N/A N/A
Register Informatica Servers with
the repository.
Edit server variable directories.
Start the Informatica Server. (The
user entered in the Informatica
Server setup must have this
repository privilege.)
Stop the Informatica Server through
the Workflow Manager.
Stop the Informatica Server using
the pmcmd program.
Super User N/A N/A
Perform all tasks, across all folders
in the repository.
Manage connection object
permissions.
Extended privileges allow you to perform more tasks and expand the access you have
to repository objects. Informatica recommends that you reserve extended privileges for
individual users and grant default privileges to groups.
Audit trails
Audit trails can be accessed through the Repository Server Administration Console. The
repository agent logs security changes in the repository server installation directory.

The audit trail can be turned on or off through the Configuration tab of the Properties
window for each individual repository. Audit trails can be toggled on or off by choosing
the SecurityAuditTrail checkbox as shown in the following illustration.

The audit log contains the following information:
Changing the owner, owner's group, or permissions for a folder.
Changing the password of another user.
Adding or removing a user.
Adding or removing a group.
Adding or removing users from a group.
Changing global object permissions.
Adding or removing user and group privileges.

Sample Security Implementation
The following steps provide an example of how to establish users, groups, permissions
and privileges in your environment. Again, the requirements of your projects and
production systems need to dictate how security is established.

1. Identify users and the environments they will support (development, UAT, QA,
production, production support, etc).
2. Identify the PowerCenter repositories in your environment (this may be similar
to the basic groups listed in Step 1, e.g., development, UAT, QA, production,
etc).
3. Identify what users need to exist in each repository.
4. Define the groups that will exist in each PowerCenter Repository.
5. Assign users to groups.
6. Define privileges for each group.
The following table provides an example of groups and privileges that may exist in the
PowerCenter repository. This example assumes one PowerCenter project with three
environments co-existing in one PowerCenter repository.


I
nf
or
m
at
ic
a
P
owerCenter Security Administration
As mentioned earlier, one individual should be identified as the Informatica
Administrator. This individual should be responsible for a number of tasks in the
Informatica environment, including security. To summarize, here are the security-
related tasks an administrator should be responsible for:
Creating user accounts.
Defining and creating groups.
Defining and granting folder permissions.
Defining and granting repository privileges.
GROUP NAME FOLDER FOLDER
PERMISSIONS
PRIVILEGES

ADMINISTRATORS All All

Super User (all privileges)

DEVELOPERS
Individual
development
folder;
integrated
development
folder
Read, Write,
Execute

Use Designer, Browse
Repository, Use Workflow
Manager

DEVELOPERS UAT Read

Manager

UAT
UAT working
folder
Read, Write,
Execute

Manager

UAT Production Read

Manager

OPERATIONS Production Read, Execute

Browse Repository, Workflow
Operator

PRODUCTION
SUPPORT
Production
maintenance
folders
Read, Write,
Execute
Manager

PRODUCTION
SUPPORT Production Read

Browse Repository

Enforcing changes in passwords.
Controlling requests for changes in privileges.
Creating and maintaining database, FTP, and external loader connections in
conjunction with database administrator.
Working with operations group to ensure tight security in production environment.
Remember, you must have one of the following privileges to administer repository
users:
Administer Repository
Super User

Summary of Recommendations
When implementing your security model, keep the following recommendations in mind:
Create groups with limited privileges.
Do not use shared accounts.
Limit user and group access to multiple repositories.
Customize user privileges.
Limit the Super User privilege.
Limit the Administer Repository privilege.
Restrict the Workflow Operator privilege.
Follow a naming convention for user accounts and group names.
For more secure environments, turn Audit Trail logging on.


Custom XConnect Implementation
Challenge
Each XConnect extracts metadata from a particular repository type and loads it into the
SuperGlue warehouse. The SuperGlue Configuration Console is used to run each
XConnect. Custom XConnect is the process of loading metadata, for tools or processes
for which Informatica does not provide any out-of-the-box metadata solution.
Description
To integrate custom metadata, complete the steps for the following tasks:
Design the metamodel
Implement the metamodel design
Setup and run the custom XConnect
Configure the reports and schema

Prerequisites for Integrating Custom Metadata
To integrate custom metadata, install SuperGlue and the other required applications.
The custom metadata integration process assumes knowledge of following topics:
Common Warehouse Metamodel CWM and Informatica-defined metamodels. The
CWM metamodel includes industry-standard packages, classes, and class
associations. The Informatica-defined metamodel supplements the CWM
metamodel by providing repository-specific packages, classes, and class
associations. For more information about CWM, see http://www.omg.org/cwm/.
For more information about the Informatica-defined metamodel components,
run and review the metamodel reports.
PowerCenter functionality. Metadata integration process requires configuring and
running PowerCenter workflows that extract custom metadata from source
repositories and loading it into the SuperGlue warehouse. PowerCenter can be
used to build a custom XConnect.
PowerAnalyzer functionality. SuperGlue embeds PowerAnalyzer functionality to
create, run, and maintain a metadata reporting environment. Knowledge of
creating, modifying, and deleting reports, dashboards, and analytic workflows in
PowerAnalyzer is required. A knowledge of creating, modifying, and deleting

table definitions, metrics, and attributes is required to update the schema with
new or changed objects.

Design the Metamodel
The objective of this phase is to design the metamodel. A UML modeling tool can be
used to help define the classes, class properties, and associations.
This task consists of the following steps:
1. Identify Custom Classes. Identify all custom classes. To identify classes,
determine the various types of metadata in the source repository that needs to
be loaded into the SuperGlue warehouse. Each type of metadata corresponds to
one class.
2. Identify Custom Class Properties. For each class identified in step 1, identify all
class properties that need to be tracked in the SuperGlue warehouse.
3. Map Custom Classes to CWM Classes. SuperGlue prepackages all CWM classes,
class properties, and class associations. To quickly develop a custom metamodel
and reduce redundancy, reuse the predefined class properties and associations
instead of recreating them. To determine which custom classes can inherit
properties from CWM classes, map custom classes to the packaged CWM
classes. For all properties that cannot be inherited, define them in SuperGlue.
4. Determine the Metadata Tree Structure. Configure the way the metadata tree
displays objects. Configure the metadata tree structure for a class when defining
the class in the next task "Implement the Metamodel Design". Configure classes
of objects to display in the metadata tree along with folders and the objects they
contain.
5. Identify Custom Class Associations. The metadata browser uses class
associations to display metadata. For each identified class association,
determine if a predefined association from a CWM base class can be reused or if
an association needs to be defined manually in SuperGlue.
6. Identify Custom Packages. A package contains related classes and class
associations. Import and export packages of classes and class associations from
SuperGlue. Assign packages to repository types to define the structure of the
contained metadata. In this step, identify packages to group the custom classes
and associations you identified in previous steps.
Implement the Metamodel Design
Using the metamodel design specifications from the previous task, implement the
metamodel in SuperGlue. To complete the steps in this task, you will need one of the
following roles:
Advanced Provider
Schema Designer
System Administrator
This task includes the following steps.

1. Create Custom Metamodel Originator in SuperGlue. The SuperGlue warehouse
may contain many metamodels that store metadata from a variety of source
systems. When creating a new metamodel, enter the originator of each
metamodel. An originator is the organization that creates and owns the
metamodel. When defining a new custom originator in SuperGlue, select
Customer as the originator type.
2. Create Custom Packages in SuperGlue. Define the packages to which custom
classes and associations are assigned. Packages contain classes and their class
associations. Packages have a hierarchical structure, where one package can be
the parent of another package. Parent packages are generally used to group
child packages together.
3. Create Custom Classes in SuperGlue. In this step, create custom classes
identified in the metamodel design task.
4. Create Custom Class Associations in SuperGlue. In this step, implement the
custom class associations identified in the metamodel design phase. In the
previous step, CWM classes are added as base classes. Any of the class
associations from the CWM base classes can be reused. Define those custom
class associations that cannot be reused.
5. Create Custom Repository Type in SuperGlue. Each type of repository contains
unique metadata. For example, a PowerCenter data integration repository type
contains workflows and mappings, but a PowerAnalyzer business intelligence
repository type does not.
6. Associate packages to Custom Repository Type. To maintain the uniqueness of
each repository type, define repository types in SuperGlue, and for each
repository type, assign packages of classes and class associations to it.
Setup and Run the XConnect
The objective of this task is to set up and run the custom XConnect. Transform source
metadata into the required format specified in the IME interface files. The custom
XConnect then extracts the metadata from the IME interface file and loads it into the
SuperGlue warehouse.
This task includes the following steps:
1. Determine which SuperGlue warehouse tables to load. Based on the type of
metadata that needs to be viewed in the metadata directory and reports,
determine which SuperGlue warehouse tables are required for the metadata
load. To stop the metadata load into particular SuperGlue warehouse tables,
disable the worklets that load those tables.
2. Reformat the source metadata. In this step, reformat the source metadata so
that it conforms with the format specified in each required IME interface file.
Present the reformatted metadata in a valid source type format. To extract the
reformatted metadata, the integration workflows require that the reformatted
metadata be in one or more of the following source type formats: database
table, database view, or flat file. Metadata can be loaded into a SuperGlue
warehouse table using more than one of the accepted source type formats. For
example, loading metadata into the IMW_ELEMENT table from a database view
and a flat file.
3. Register the Source Repository Instance in SuperGlue. Before extracting
metadata, you must first register the source repository in SuperGlue. Register
the repository under the custom repository type created in previous task. All

packages, classes, and class associations defined for the custom repository type
apply to all repository instances registered to the repository type. When defining
the repository, provide descriptive information about the repository instance.
When registering the repository, define a repository ID that must uniquely
identify the repository. If the source repository stores a repository ID, use that
value for the repository ID. Once the repository is registered in SuperGlue,
SuperGlue adds an XConnect in the Configuration Console for the repository. To
register a repository in SuperGlue, you will need one of the following roles:
Advanced Provider, Schema Designer, or System Administrator.
4. Configure the Custom Parameter File. SuperGlue prepackages the parameter
files for each XConnect. Update the parameter file by specifying the following
information: source type (e.g., database table, database view, or flat file), name
of the database views or tables used to load the SuperGlue warehouse, list of all
flat files used to load a particular SuperGlue warehouse table, frequency at
which the SuperGlue warehouse is updated, worklets that need to be enabled
and disabled, and the method used to determine field datatypes.
5. Configure the Custom XConnect. Once the custom repository type is defined in
SuperGlue, the SuperGlue Server registers the corresponding XConnect in the
Configuration Console. Specify the following information in the Configuration
Console to configure the XConnect: repository type to which the custom
repository belongs, workflows required to load the metadata, name of the
XConnect, and parameter file used by the workflows to load the metadata.
6. Run the Custom XConnect. Using the Configuration Console, run the XConnect
and ensure that the metadata loads correctly.
7. Reset the $$SRC_INCR_DATE Parameter. After completing the first metadata
load, reset the $$SRC_INCR_DATE parameter to extract metadata in shorter
intervals, such as every 5 days. The value depends on how often the SuperGlue
warehouse needs to be updated. If the source does not provide the date when
the records were last updated, records are extracted regardless of the
$$SRC_INCR_DATE parameter setting.
Configure the Reports and Schema
The objective of this task is to set up the reporting environment, which needs to run
reports on the metadata stored in the SuperGlue warehouse. How you set up the
reporting environment depends on the reporting requirements. The following options
are available for creating reports:
Use the existing schema and reports. SuperGlue contains packaged reports that
can be used to analyze business intelligence metadata, data integration
metadata, data modeling tool metadata, and database catalog metadata.
SuperGlue also provides impact analysis and lineage reports that provide
information on any type of metadata.
Create new reports using the existing schema. Build new reports using the existing
SuperGlue metrics and attributes.
Create new SuperGlue warehouse tables and views to support the schema and
reports. If the packaged SuperGlue schema does not meet the reporting
requirements, create new SuperGlue warehouse tables and views. Prefix the
name of custom-built tables with Z_IMW_. Prefix custom-built views with
Z_IMA_. If you build new SuperGlue warehouse tables or views, register the
tables in the SuperGlue schema and create new metrics/attributes in the

SuperGlue schema. Note that SuperGlue schema is built on the SuperGlue
views.
After the environment setup is complete, test all schema objects, such as dashboards,
analytic workflows, reports, metrics, attributes, and alerts.


Customizing the SuperGlue Interface
Challenge
Customizing the SuperGlue presentation layer to meet specific business needs.
Description

Configuring Metamodels
It may be necessary to configure metamodels for a repository type in order to integrate
additional metadata into a SuperGlue Warehouse and/or to adapt to changes in
metadata reporting and browsing requirements. For more information about creating a
metamodel for a new repository type, see the SuperGlue Custom Metadata Integration
Guide.
Use SuperGlue to define a metamodel, which consists of the following objects:
Originator - the party that creates and owns the metamodel.
Packages - contain related classes that model metadata for a particular application
domain or specific application. Multiple packages can be defined under the newly
defined originator. Each package stores classes and associations that represent
the metamodel.
Classes and Class Properties - define a type of object, with its property, contained
in a repository. Multiple classes can be defined under a single package. Each
class has multiple properties associated to it. These properties can be inherited
from one or many base classes already available. Additional properties can be
defined directly under the new class.
Associations - defines the relationship between classes and their objects.
Associations help define relationships across individual classes. The cardinality
helps define 1-1, 1-n or n-n relationships. These relationships mirror real life
associations of logical, physical, or design level building blocks of systems and
processes.
For more information about metamodels, originators, packages, classes, and
associations, see SuperGlue Concepts in the SuperGlue
InstallationandAdministration Guide

After the metamodel is defined, it needs to be associated with a repository type. When
registering a repository under a repository type, all classes and associations assigned to
the repository type through packages apply to the repository.
The Metamodel Management task area on the Administration tab in SuperGlue provides
the following options for configuring metamodels:
Repository types
You can configure types of repositories for the metadata you want to store and manage
in the SuperGlue Warehouse. You must configure a repository type when you develop
an XConnect. You can modify some attributes for existing Xconnects and XConnect
repository types. For more information, see Configuring Repository Types in the
SuperGlue Installation and Administration Guide.
Displaying Objects of an Association in the Metadata Tree
SuperGlue displays many objects in the metadata tree by default because of the
predefined associations among metadata objects. Associations determine how objects
display in the metadata tree.
If you want to display an object in the metadata tree that does not already display, add
an association between the objects in the IMM.properties file.
For example, Object A displays in the metadata tree and Object B does not. To display
Object B under Object A in the metadata tree, perform the following actions:
Create an association from Object B to Object A. From Objects in an association
display as parent objects; To Objects display as child objects. The To Object
displays in the metadata tree only if the From Object in the association already
displays in the metadata tree. For more information about adding associations,
refer to Adding an Association in the SuperGlue
InstallationandAdministration Guide
Add the association to the IMM.properties file. SuperGlue only displays objects in
the metadata tree if the corresponding association between their classes is
included in the IMM.properties file.
Note: Some associations are not explicitly defined among the classes of objects. Some
objects reuse associations based on the ancestors of the classes. The metadata tree
displays objects that have explicit or reused associations. For more information about
ancestors and reusing associations, see Reusing Class Associations of a Base Class or
Ancestor in the SuperGlue InstallationandAdministration Guide
To add the association to the IMM.properties file
1. Open the IMM.properties file. The file is located in the following directory:
For WebLogic: <WebLogic_Home>\wlserver6.1
For WebSphere: <WebSphere_Home>\AppServer
2. Add the association ID under findtab.parentChildAssociations.


To determine the ID of an association, click the association on the Associations
page.
To access the Associations page, click Administration > Metamodel Management >
Associations.
Save and close the IMM.properties file.
Stop and then restart the SuperGlue Server to apply the changes.

Customizing SuperGlue Metadata Browser
The Metadata Browser, on the Metadata Directory page, is used for browsing source
repository metadata stored in the SuperGlue Warehouse. The following figure shows a
sample metadata directory page on the Find Tab of SuperGlue.


The Metadata Directory page consists of the following areas:
Query task area - allows you to search for metadata objects stored in the
SuperGlue Warehouse.
Metadata Tree task area - allows you to navigate to a metadata object in a
particular repository.
Results task area - displays metadata objects based on an object search in the
Query task area or based on the object selected in the Metadata Tree task area.
Details task area - displays properties about the selected object. You can also view
associations between the object and other objects, and run related reports from
the Details task area.
For more information about the Metadata Directory page on the Find tab, refer
Accessing Source Repository Metadata chapter in the SuperGlue User Guide.
You can perform the following customizations while browsing the source repository
metadata:
Configure the display properties
SuperGlue displays a set of default properties for all items in the Results task area. The
default properties are generic properties that apply to all metadata objects stored in the
SuperGlue Warehouse.

By default, SuperGlue displays the following properties in the Results task area for each
source repository object:
Class - Displays an icon that represents the class of the selected object. The class
name appears when you place the pointer over the icon.
Label - Label of the object.
Source Update Date - Date the object was last updated in the source repository.
Repository Name - Name of the source repository from which the object originates.
Description - Description of the object.
The default properties that appear in the Results task area can, however, be
rearranged, added, and/or removed for a SuperGlue user account. For example, you
can remove the default Class and Source Update Date properties, move the Repository
Name property to precede the Label property, and add a different property, such as the
Warehouse Insertion Date, to the list.
Additionally, you can add other properties that are specific to the class of the selected
object. With the exception of Label, all other default properties can be removed. You
can select up to ten properties to display in the Results task area. SuperGlue displays
them in the order specified while configuring.
If there are more than ten properties to display, SuperGlue displays the first ten,
displaying common properties first in the order specified and then all remaining
properties in alphabetical order based on the property display label.
Applying Favorite Properties for Multiple Classes of Objects Property
The modified property display settings can be applied to any class of objects displayed
in the Results task area. When selecting an object in the metadata tree, multiple
classes of objects may appear in the Results task area. The following figure shows how
to apply the modified display settings for each class of objects in the Results task area:


The same settings can be applied to the other classes of objects that currently display
in the Results task area.
If the settings are not applied to the other classes, then the settings apply to the
objects of the same class as the object selected in the metadata tree.
Configuring Object Links
Object links are created to link related objects without navigating the metadata tree or
searching for the object. Refer to the SuperGlue User Guide to configure the object
link.
Configuring Report Links
Report Links can be created to run reports on a particular metadata object. When
creating a report link, assign a SuperGlue report to a specific object. While creating a
report link, you can also create a run report button to run the associated report. The
run report button appears in the top, right corner of the Details task area. When you

create the run report button, you also have the option of applying it to all objects of
the same class. You can create a maximum of three run report buttons per object.
Customizing Superglue Packaged Reports, Dashboards and Indicators
You can create new reporting elements and attributes under Schema Design. These
new elements can be used in new reports or existing report extensions. You can also
extend or customize "out-of-the-box" reports, indicators, or dashboards. Informatica
recommends using the Save As new report option for such changes in order to avoid
any conflicts during upgrades.
Further, you can create new reports using the 1-2-3-4 report creation wizard of
Informatica PowerAnalyzer. Informatica recommends saving such reports in a new
report folder to avoid conflict during upgrades.
Customizing Superglue ODS Reports
Use the operational data store (ODS) report templates to analyze metadata stored in a
particular repository. Although, these reports can be used as is, they can also be
customized to suit particular business requirements. Out-of-the-box reports can be
used as a guideline for creating reports for other types of source repositories, such as
a repository for which SuperGlue does not package an XConnect.


Estimating SuperGlue Volume Requirements
Challenge
Understanding the relationship between various inputs for the SuperGlue solution so as
to be able to estimate volumes for the SuperGlue Warehouse.
Description
The size of SuperGlue warehouse is directly proportional to the size of metadata being
loaded into it. The size is also dependent on the number of element attributes being
captured in source metdata and the associations defined in the metamodel.
When estimating volume requirements for a SuperGlue implementation, consider the
following SuperGlue components:
SuperGlue Server
SuperGlue Console
SuperGlue Integration Repository
SuperGlue Warehouse
NOTE: Refer to the SuperGlue Installation Guide for complete information on minimum
system requirements for server, console and integration repository.
Considerations
Volume estimation for SuperGlue is an iterative process. Use the SuperGlue
development environment to get accurate size estimate for SuperGlue production
environment. The required steps are as follow:
1. Identify the source metadata that needs to be loaded in the SuperGlue
Production warehouse.
2. Size the SuperGlue Development warehouse based on the initial sizing estimates
(as explained in next section of this document).
3. Run the XConnects and monitor the disk usage. If the XConnect run fails due to
insufficient volume, add the same number of space as per the initial sizing
estimate recommendations.
4. Restart the XConnect.
Go through steps 1 through 4 until the XConnect run is successful.

Following are the initial sizing estimates for a typical SuperGlue implementation:
SuperGlue Server

SuperGlue Console

SuperGlue Integration Repository


SuperGlue Warehouse

The following table is an initial estimation matrix that should be helpful in deriving a
reasonable initial estimation. For increased input sizes consider the expected SuperGlue
Warehouse Target size to increase in direct proportion.
XConnect INPUT Size Expected
SuperGlue
Warehouse Target
Size
Metamodel and other
tables
- 50MB
PowerCenter 1MB 10MB
PowerAnalyzer 1MB 4MB
Database 1MB 5MB
Other XConnect 1MB 4.5MB


SuperGlue Metadata Load Validation
Challenge
In the same way that knowing that all data for the current load cycle has loaded
correctly is essential for good data warehouse management, the same goes for
validating that all metadata extractions (XConnects) loaded correctly into the
SuperGlue warehouse. If metadata extractions do not execute successfully, the
SuperGlue warehouse will not be current with the most up-to-date metadata.
Description
The process for validating the SuperGlue metadata loads is very simple using the
SuperGlue Configuration Console. In the SuperGlue Configuration Console, you can
view the run history for each of the XConnects. For those who are familiar with
PowerCenter, the Run History portion of the SuperGlue Configuration Console is
similar to the Workflow Monitor in PowerCenter.
To view XConnect run history, first log into the SuperGlue Configuration Console.

After logging into the console, click XConnects > Execute Now (or click on the Execute
Now shortcut on the left navigation panel).


The XConnect run history is displayed (see below)on the Execute Now screen. A
SuperGlue Administrator should log into the SuperGlue Configuration Console on a
regular basis and verify that all XConnects that were scheduled ran to successful
completion.


If any XConnects have a status of Failure as noted above, the issue should be
investigated to correct it and the XConnect should be re-executed. XConnects can fail
for a variety of reasons common in IT such as unavailability of the database, network
failure, improper configuration, etc.
More detailed error messages can be found in the event log or in the workflow log files.
By clicking on the Schedule shortcut on the left navigation pane in the SuperGlue
Configuration Console, you can view the logging options that are set up for the
XConnect. In most cases, the logging is setup to write to the
<SUPERGLUE_HOME>/Console/SuperGlue_Log.log file.


After investigating and correcting the issue, the XConnect that failed should be re-
executed at the next available time in order to load the most recent metadata.


Using SuperGlue Console to Tune the XConnects
Challenge
Improving the efficiency and reducing the run-time of your XConnects through the
parameter settings of the SuperGlue console
Description
Remember that the minimum system requirements for a machine hosting the
SuperGlue console are:
Windows operating system (2000, NT 4.0 SP 6a)
400MB disk space
128MB RAM (256MB recommended)
133 MHz processor.
If the system meets or exceeds the minimal requirements, but an XConnect is still
taking a inordinately long time to run, use the following steps to try to improve its
performance.
To improve performance of your XConnect loads from database catalogs:
Modify the inclusion/exclusion schema list (if schema to be loaded is more than
exclusion, then use exclusion)
Carefully examine how many old objects the project needs by default. Modify the
sysdate -5000 to a smaller value to reduce the result set.
To improve performance of your XConnect loads from the PowerCenter repository:
Load only the production folders that are needed for a particular project.
Run the Xconnects with just one folder at a time, or select the list of folders for a
particular run.

Best Informatica Practices26064718

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Best Informatica Practices26064718

Uploaded by

Copyright:

Available Formats

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-1

Best Practices: Table of Contents

You might also like