P. 1


|Views: 192|Likes:
Published by Amjed Khan

More info:

Published by: Amjed Khan on Jan 10, 2011
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less






  • Migration Procedures
  • Data Connectivity Using PowerConnect for BW Integration Server
  • Data Connectivity using PowerConnect for Mainframe
  • Data Connectivity using PowerConnect for MQSeries
  • Data Connectivity using PowerConnect for PeopleSoft
  • Data Connectivity using PowerConnect for SAP
  • Metadata Reporting and Sharing
  • Session and Data Partitioning
  • Using Parameters, Variables and Parameter Files
  • A Mapping Approach to Trapping Data Errors
  • Design Error Handling Infrastructure
  • Documenting Mappings Using Repository Reports
  • Error Handling Strategies
  • Using Shortcut Keys in PowerCenter Designer
  • Creating Inventories of Reusable Objects & Mappings
  • Updating Repository Statistics
  • Third Party Scheduler
  • Event Based Scheduling
  • Repository Administration
  • Recommended Performance Tuning Procedures
  • Performance Tuning Databases
  • Performance Tuning UNIX Systems
  • Performance Tuning Windows NT/2000 Systems
  • Tuning Mappings for Better Performance
  • Tuning Sessions for Better Performance
  • Determining Bottlenecks
  • Advanced Client Configuration Options
  • Advanced Server Configuration Options
  • Running Sessions in Recovery Mode
  • Developing the Business Case
  • Assessing the Business Case
  • Defining and Prioritizing Requirements
  • Developing and Maintaining the Project Plan
  • Managing the Project Lifecycle


To develop a migration strategy that ensures clean migration between development,
test, QA, and production, thereby protecting the integrity of each of these
environments as the system evolves.


In every application deployment, a migration strategy must be formulated to ensure
a clean migration between development, test, quality assurance, and production. The
migration strategy is largely influenced by the technologies that are deployed to
support the development and production environments. These technologies include
the databases, the operating systems, and the available hardware.

Informatica offers flexible migration techniques that can be adapted to fit the
existing technology and architecture of various sites, rather than proposing a single
fixed migration strategy. The means to migrate work from development to
production depends largely on the repository environment, which is either:

• Standalone PowerCenter, or
• Distributed PowerCenter

This Best Practice describes several migration strategies, outlining the advantages
and disadvantages of each. It also discusses an XML method provided in
PowerCenter 5.1 to support migration in either a Standalone or a Distributed

Standalone PowerMart/PowerCenter

In a standalone environment, all work is performed in a single Informatica repository
that serves as the shared metadata store. In this standalone environment,
segregating the workspaces ensures that the migration from development to
production is seamless.

Workspace segregation can be achieved by creating separate folders for each work
area. For instance, we might build a single data mart for the finance division within a




corporation. In this example, we would create a minimum of four folders to manage
our metadata. The folders might look something like the following:

In this scenario, mappings are developed in the FINANCE_DEV folder. As
development is completed on particular mappings, they will be copied one at a time
to the FINANCE_TEST folder. New sessions will be created or copied for each
mapping in the FINANCE_TEST folder.

When unit testing has been completed successfully, the mappings are copied into the
FINANCE_QA folder. This process continues until the mappings are integrated into
the production schedule. At that point, new sessions will be created in the
FINANCE_PROD folder, with the database connections adjusted to point to the
production environment.

Introducing shortcuts in a single standalone environment complicates the migration
process, but offers an efficient method for centrally managing sources and targets.

A common folder can be used for sharing reusable objects such as shared sources,
target definitions, and reusable transformations. If a common folder is used, there
should be one common folder for each environment (i.e., SHARED_DEV,

Migration Example Process

Copying the mappings into the next stage enables the user to promote the desired
mapping to test, QA, or production at the lowest level of granularity. If the folder
where the mapping is to be copied does not contain the referenced source/target
tables or transformations, then these objects will automatically be copied along with
the mapping. The advantage of this promotion strategy is that individual mappings
can be promoted as soon as they are ready for production. However, because only
one mapping at a time can be copied, promoting a large number of mappings into
production would be very time consuming. Additional time is required to re-create or
copy all sessions from scratch, especially if pre- or post-session scripts are used.

On the initial move to production, if all mappings are completed, the entire
FINANCE_QA folder could be copied and renamed to FINANCE_PROD. With this
approach, it is not necessary to promote all mappings and sessions individually. After
the initial migration, however, mappings will be promoted on a “case-by-case” basis.




Follow these steps to copy a mapping from Development to Test:

1. If using shortcuts, first follow these substeps; if not using shortcuts, skip to step

• Create four common folders, one for each migration stage (COMMON_DEV,
• Copy the shortcut objects into the COMMON_TEST folder.

2. Copy the mapping from Development into Test.

• In the PowerCenter Designer, open the appropriate test folder, and drag and
drop the mapping from the development folder into the test folder.

3. If using shortcuts, follow these substeps; if not using shortcuts, skip to step 4:

• Open the mapping that uses shortcuts.
• Using the newly copied mapping, open it in the Designer and bring in the
newly copied shortcut.
• Using the old shortcut as a model, link all of the input ports to the new
• Using the old shortcut as a model, link all of the output ports to the new

However, if any of the objects are active, first delete the old shortcut before linking
the output ports.

4. Create or copy a session in the Server Manager to run the mapping (make sure
the mapping exists in the current repository first).

• If copying the mapping, follow the copy session wizard.




• If creating the mapping, enter all the appropriate information in the Session


5. Implement appropriate security, such as:

• In Development, the owner of the folders should be a user in the
development group.
• In Test and Quality Assurance, change the owner of the Test/QA folders to a
user in the Test/QA group.
• In Production, change the owner of the folders to a user in the Production


• Revoke all rights to Public other than Read for the Production folders.

Performance Implications in the Single Environment

A disadvantage of the single environment approach is that even though the
Development, Test, QA, and Production “environments” are stored in separate
folders, they all reside on the same server. This can have negative performance
implications. If Development or Test loads are running simultaneously with




Production loads, the server machine may reach 100 percent utilization and
Production performance will suffer.

Often, Production loads run late at night, and most Development and Test loads run
during the day so this does not pose a problem. However, situations do arise where
performance benchmarking with large volumes or other unusual circumstances can
cause test loads to run overnight, contending with the pre-scheduled Production

Distributed PowerCenter

In a distributed environment, there are separate, independent environments (i.e.,
hardware and software) for Development, Test, QA, and Production. This is the
preferred method for handling Development to Production migrations. Because each
environment is segregated from the others, work performed in Development cannot
impact Test, QA, or Production.

With a fully distributed approach, separate repositories provide the same function as
the separate folders in the standalone environment described previously. Each
repository has a similar name for the folders in the standalone environment. For
instance, in our Finance example we would have four repositories, FINANCE_DEV,

The mappings are created in the Development repository, moved into the Test
repository, and then eventually into the Production environment. There are three
main techniques to migrate from Development to Production, each involving some
advantages and disadvantages:

• Repository Copy
• Folder Copy
• Object Copy

Repository Copy

The main advantage to this approach is the ability to copy everything at once from
one environment to another, including source and target tables, transformations,
mappings, and sessions. Another advantage is the ability to automate this process
without having users perform this process. The final advantage is that everything
can be moved without breaking/corrupting any of the objects.

There are, however, three distinct disadvantages to the repository copy method. The
first is that everything is moved at once (also an advantage). The trouble with this is
that everything is moved, ready or not. For example, there may be 50 mappings in
QA but only 40 of them are production-ready. The 10 unready mappings are moved
into production along with the 40 production-ready maps, which leads to the second
disadvantage -- namely that maintenance is required to remove any unwanted or
excess objects. Another disadvantage is the need to adjust server variables,
sequences, parameters/variables, database connections, etc. Everything will need to
be set up correctly on the new server that will now host the repository.

There are three ways to accomplish the Repository Copy method:




• Copying the Repository
• Repository Backup and Restore

Copying the Repository

The repository copy command is probably the easiest method of migration. To
perform this one needs to go the file menu of the Repository Manager and select
Copy Repository. From there the user is prompted to choose the location to which
the repository will be copied. The following screen shot shows the dialog box used to
input the new location information:

To successfully perform the copy, the user must delete the current repository in the
new location. For example, if a user was copying a repository from DEV to TEST,
then the TEST repository must first be deleted using the Delete option in the
Repository Manager to create room for the new repository. Then the Copy Repository
routine must be run.

Repository Backup and Restore

The Backup and Restore Repository is another simple method of copying an entire
repository. To perform this function, go to the File menu in the Repository Manager
and select Backup Repository. This will create a .REP file containing all repository
information. To restore the repository simply open the Repository Manager on the
destination server and select Restore Repository from the File menu. Select the
created .REP file to automatically restore the repository in the destination server. To
ensure success, be sure to first delete any matching destination repositories, since
the Restore Repository option does not delete the current repository.


Using the PMREP commands is essentially the same as the Backup and Restore
Repository method except that it is run from the command line. The PMREP utilities
can be utilized both from the Informatica Server and from any client machines
connected to the server.




The following table documents the available PMREP commands:

The following is a sample of the command syntax used within a batch file to connect
to and backup a repository. Using the code example below as a model, scripts can be
written to be run on a daily basis to perform functions such as connect, backup,
restore, etc:

After following one of the above procedures to migrate into Production, follow these
steps to convert the repository to Production:

1. Disable sessions that schedule mappings that are not ready for Production or
simply delete the mappings and sessions.

• Disable the sessions in the Server manager by opening the session properties,
and then clearing the Enable checkbox under the General tab.
• Delete the sessions in the Server Manager and the mappings in the Designer.




2. Modify the database connection strings to point to the Production sources and

• In the Server Manager, select Database Connections from the Server
Configuration menu.
• Edit each database connection by changing the connect string to point to the
production sources and targets.
• If using lookup transformations in the mappings and the connect string is
anything other than $SOURCE or $TARGET, then the connect string will need
to be modified appropriately.

3. Modify the pre- and post-session commands as necessary.

• In the Server Manager, open the session properties, and from the General tab
make the required changes to the pre- and post-session scripts.

4. Implement appropriate security, such as:

• In Development, ensure that the owner of the folders is a user in the
Development group.
• In Test and Quality Assurance, change the owner of the Test/QA folders to a
user in the Test/QA group.
• In Production, change the owner of the folders to a user in the Production


• Revoke all rights to Public other than Read for the Production folders.

Folder Copy

Copying an entire folder allows you to quickly promote all of the objects in the
Development folder to Test, and so forth. All source and target tables, reusable
transformations, mappings, and sessions are promoted at once. Therefore,
everything in the folder must be ready to migrate forward. If certain mappings are
not ready, then after the folder is copied, developers (or the Repository
Administrator) must manually delete these mappings from the new folder.

The advantages of Folder Copy are:

• Easy to move the entire folder and all objects in it
• Detailed Wizard guides the user through the entire process
• There’s no need to update or alter any Database Connections, sequences or
server variables.

The disadvantages of Folder Copy are:

• User needs to be logged into multiple environments simultaneously.
• The repository is locked while Folder Copy is being performed.

If copying a folder, for example, from QA to Production, follow these steps:

1. If using shortcuts, follow these substeps; otherwise skip to step 2:




• In each of the dedicated repositories, create a common folder using exactly
the same name and case as in the “source” repository.
• Copy the shortcut objects into the common folder in Production and make
sure the shortcut has exactly the same name.
• Open and connect to either the Repository Manager or Designer.

2. Drag and drop the folder onto the production repository icon within the
Navigator tree structure. (To copy the entire folder, drag and drop the folder icon
just under the repository level.)

3. Follow the Copy Folder Wizard steps. If a folder with that name already exists,
it must be renamed.

4. Point the folder to the correct shared folder if one is being used:




After performing the Folder Copy method, be sure to remember the following steps:

1. Modify the pre- and post-session commands as necessary:

• In the Server Manager, open the session properties, and from the General tab
make the required changes to the pre- and post-sessions scripts.

2. Implement appropriate security:

• In Development, ensure the owner of the folders is a user in the Development


• In Test and Quality Assurance, change the owner of the Test/QA folders to a
user in the Test/QA group.
• In Production, change the owner of the folders to a user in the Production


• Revoke all rights to Public other than Read for the Production folders.

Object Copy

Copying mappings into the next stage within a networked environment has many of
the same advantages and disadvantages as in the standalone environment, but the
process of handling shortcuts is simplified in the networked environment. For
additional information, see the previous description of Object Copy for the
standalone environment.

Additional advantages and disadvantages of Object Copy in a distributed
environment include:


• More granular control over objects





• Much more work to deploy an entire group of objects
• Shortcuts must exist prior to importing/copying mappings

1. If using shortcuts, follow these substeps, otherwise skip to step 2:

• In each of the dedicated repositories, create a common folder with the exact
same name and case.
• Copy the shortcuts into the common folder in Production making sure the
shortcut has the exact same name.

2. Copy the mapping from quality assurance (QA) into production.

• In the Designer, connect to both the QA and Production repositories and open
the appropriate folders in each.
• Drag and drop the mapping from QA into Production.

3. Create or copy a session in the Server Manager to run the mapping (make
sure the mapping exists in the current repository first).

• If copying the mapping follow the copy session wizard.
• If creating the mapping, enter all the appropriate information in the Session





4. Implement appropriate security.

• In Development, ensure the owner of the folders is a user in the Development


• In Test and Quality Assurance, change the owner of the Test/QA folders to a
user in the Test/QA group.
• In Production, change the owner of the folders to a user in the Production


• Revoke all rights to Public other than Read for the Production folders.


Informatica recommends using the following process when running in a three-tiered
environment with Development, Test/QA, and Production servers:

For migrating from Development into Test, Informatica recommends using the
Object Copy method. This method gives you total granular control over the objects
that are being moved. It ensures that the latest development maps can be moved
over manually as they are completed. For recommendations on performing this copy
procedure correctly, see the steps outlined in the Object Copy section.




When migrating from Test to Production, Informatica recommends using the
Repository Copy method. Before performing this migration, all code in the Test
server should be frozen and tested. After the Test code is cleared for production, use
one of the repository copy methods. (Refer to the steps outlined in the Repository
Copy section for recommendations to ensure that this process is successful.). If
similar server and database naming conventions are utilized, there will be minimal or
no changes required to sessions that are created or copied to the production server.

XML Object Copy Process

Another method of copying objects in a distributed (or centralized) environment is to
copy objects by utilizing PM/PC’s XML functionality. This method is more useful in the
distributed environment because it allows for backup into an XML file to be moved
across the network.

The XML Object Copy Process works in a manner very similar to the Repository Copy
backup and restore method, as it allows you to copy sources, targets, reusable
transformations, mappings, and sessions. Once the XML file has been created, that
XML file can be changed with a text editor to allow more flexibility. For example, if
you had to copy one session many times, you would export that session to an XML
file. Then, you could edit that file to find everything within the tag, copy
that text, and paste that text within the XML file. You would then change the name
of the session you just pasted to be unique. When you imported that XML file back
into your folder, two sessions will be created. The following demonstrates the
import/export functionality:

1. Objects are exported into an XML file:




2. Objects are imported into a repository from the corresponding XML file:

3. Sessions can be exported and imported into the Server Manager in the same
way (the corresponding mappings must exist for this to work).







Development FAQs


Using the PowerCenter product suite to most effectively to develop, name, and
document components of the analytic solution. While the most effective use of
PowerCenter depends on the specific situation, this Best Practice addresses some
questions that are commonly raised by project teams. It provides answers in a
number of areas, including Scheduling, Backup Strategies, Server Administration,
and Metadata. Refer to the product guides supplied with PowerCenter for additional


The following pages summarize some of the questions that typically arise during
development and suggest potential resolutions.

Q: How does source format affect performance? (i.e., is it more efficient to source
from a flat file rather than a database?)

In general, a flat file that is located on the server machine loads faster than a
database located on the server machine. Fixed-width files are faster than
delimited files because delimited files require extra parsing. However, if there
is an intent to perform intricate transformations before loading to target, it
may be advisable to first load the flat-file into a relational database, which
allows the PowerCenter mappings to access the data in an optimized fashion
by using filters and custom SQL SELECTs where appropriate.

Q: What are some considerations when designing the mapping? (i.e. what is the
impact of having multiple targets populated by a single map?)

With PowerCenter, it is possible to design a mapping with multiple targets.
You can then load the targets in a specific order using Target Load Ordering.
The recommendation is to limit the amount of complex logic in a mapping.
Not only is it easier to debug a mapping with a limited number of objects, but
they can also be run concurrently and make use of more system resources.
When using multiple output files (targets), consider writing to multiple disks
or file systems simultaneously. This minimizes disk seeks and applies to a




session writing to multiple targets, and to multiple sessions running

Q: What are some considerations for determining how many objects and
transformations to include in a single mapping?

There are several items to consider when building a mapping. The business
requirement is always the first consideration, regardless of the number of
objects it takes to fulfill the requirement. The most expensive use of the DTM
is passing unnecessary data through the mapping. It is best to use filters as
early as possible in the mapping to remove rows of data that are not needed.
This is the SQL equivalent of the WHERE clause. Using the filter condition in
the Source Qualifier to filter out the rows at the database level is a good way
to increase the performance of the mapping.

Log File Organization

Q: Where is the best place to maintain Session Logs?

One often-recommended location is the default /SessLogs/ folder in the
Informatica directory, keeping all log files in the same directory.

Q: What documentation is available for the error codes that appear within the error
log files?

Log file errors and descriptions appear in Appendix C of the PowerCenter
User Guide. Error information also appears in the PowerCenter Help File
within the PowerCenter client applications. For other database-specific errors,
consult your Database User Guide.

Scheduling Techniques

Q: What are the benefits of using batches rather than sessions?

Using a batch to group logical sessions minimizes the number of objects that
must be managed to successfully load the warehouse. For example, a
hundred individual sessions can be logically grouped into twenty batches. The
Operations group can then work with twenty batches to load the warehouse,
which simplifies the operations tasks associated with loading the targets.

There are two types of batches: sequential and concurrent.

o A sequential batch simply runs sessions one at a time, in a linear
sequence. Sequential batches help ensure that dependencies are met
as needed. For example, a sequential batch ensures that session1 runs
before session2 when session2 is dependent on the load of session1,
and so on. It's also possible to set up conditions to run the next
session only if the previous session was successful, or to stop on
errors, etc.




o A concurrent batch groups logical sessions together, like a sequential
batch, but runs all the sessions at one time. This can reduce the load
times into the warehouse, taking advantage of hardware platforms'
Symmetric Multi-Processing (SMP) architecture. A new batch is
sequential by default; to make it concurrent, explicitly select the
Concurrent check box.

Other batch options, such as nesting batches within batches, can further
reduce the complexity of loading the warehouse. However, this capability
allows for the creation of very complex and flexible batch streams without the
use of a third-party scheduler.

Q: Assuming a batch failure, does PowerCenter allow restart from the point of

Yes. When a session or sessions in a batch fail, you can perform recovery to
complete the batch. The steps to take vary depending on the type of batch:

If the batch is sequential, you can recover data from the session that failed
and run the remaining sessions in the batch. If a session within a concurrent
batch fails, but the rest of the sessions complete successfully, you can
recover data from the failed session targets to complete the batch. However,
if all sessions in a concurrent batch fail, you might want to truncate all targets
and run the batch again.

Q: What guidelines exist regarding the execution of multiple concurrent sessions /
batches within or across applications?

Session/Batch Execution needs to be planned around two main constraints:

• Available system resources
• Memory and processors

The number of sessions that can run at one time depends on the number of
processors available on the server. The load manager is always running as a
process. As a general rule, a session will be compute-bound, meaning its
throughput is limited by the availability of CPU cycles. Most sessions are
transformation intensive, so the DTM always runs. Also, some sessions
require more I/O, so they use less processor time. Generally, a session needs
about 120 percent of a processor for the DTM, reader, and writer in total.

For concurrent sessions:

• One session per processor is about right; you can run more, but all
sessions will slow slightly.
• Remember that other processes may also run on the PowerCenter
server machine; overloading a production machine will slow overall

Even after available processors are determined, it is necessary to look at
overall system resource usage. Determining memory usage is more difficult




than the processors calculation; it tends to vary according to system load and
number of Informatica sessions running. The first step is to estimate memory
usage, accounting for:

• Operating system kernel and miscellaneous processes

• Database engine

• Informatica Load Manager

Each session creates three processes: the Reader, Writer, and DTM.

• If multiple sessions run concurrently, each has three processes

• More memory is allocated for lookups, aggregates, ranks, and
heterogeneous joins in addition to the shared memory segment.

At this point, you should have a good idea of what is left for concurrent
sessions. It is important to arrange the production run to maximize use of this
memory. Remember to account for sessions with large memory
requirements; you may be able to run only one large session, or several small
sessions concurrently.

Load Order Dependencies are also an important consideration because they
often create additional constraints. For example, load the dimensions first,
then facts. Also, some sources may only be available at specific times, some
network links may become saturated if overloaded, and some target tables
may need to be available to end users earlier than others.

Q: Is it possible to perform two "levels" of event notification? One at the application
level, and another at the PowerCenter server level to notify the Server

The application level of event notification can be accomplished through post-
session e-mail. Post-session e-mail allows you to create two different
messages, one to be sent upon successful completion of the session, the
other to be sent if the session fails. Messages can be a simple notification of
session completion or failure, or a more complex notification containing
specifics about the session. You can use the following variables in the text of
your post-session e-mail:

E-mail Variable Description


Session name


Total records loaded


Total records rejected


Session status





Table details, including read throughput in bytes/second and write
throughput in rows/second


Session start time


Session completion time


Session elapsed time (session completion time-session start time)


Attaches the session log to the message


Attaches the named file. The file must be local to the Informatica
Server. The following are valid filenames: %a
or %a

On Windows NT, you can attach a file of any type.
On UNIX, you can only attach text files. If you attach a non-text
file, the send might fail.

Note: The filename cannot include the Greater Than character
(>) or a line break.

The PowerCenter Server on UNIX uses rmail to send post-session e-mail. The
repository user who starts the PowerCenter server must have the rmail tool
installed in the path in order to send e-mail.

To verify the rmail tool is accessible:

1. Login to the UNIX system as the PowerCenter user who starts the
PowerCenter Server.

2. Type rmail at the prompt and press Enter.

3. Type . to indicate the end of the message and press Enter.

4. You should receive a blank e-mail from the PowerCenter user's e-mail
account. If not, locate the directory where rmail resides and add that
directory to the path.

5. When you have verified that rmail is installed correctly, you are ready to
send post-session e-mail.

The output should look like the following:

Session complete.
Session name: sInstrTest
Total Rows Loaded = 1
Total Rows Rejected = 0






Read Throughput

Write Throughput

Table Name






No errors encountered.
Start Time: Tue Sep 14 12:26:31 1999
Completion Time: Tue Sep 14 12:26:41 1999
Elapsed time: 0:00:10 (h:m:s)

This information, or a subset, can also be sent to any text pager that accepts

Backup Strategy Recommendation

Q: Can individual objects within a repository be restored from the back-up or from a
prior version?

At the present time, individual objects cannot be restored from a back-up
using the PowerCenter Server Manager (i.e., you can only restore the entire
repository). But, It is possible to restore the back-up repository into a
different database and then manually copy the individual objects back into
the main repository.

Refer to Migration Procedures for details on promoting new or changed
objects between development, test, QA, and production environments.

Server Administration

Q: What built-in functions, does PowerCenter provide to notify someone in the event
that the server goes down, or some other significant event occurs?

There are no built-in functions in the server to send notification if the
server goes down. However, it is possible to implement a shell script
that will sense whether the server is running or not. For example, the
command "pmcmd pingserver" will give a return code or status which
will tell you if the server is up and running. Using the results of this
command as a basis, a complex notification script could be built.

Q: What system resources should be monitored? What should be considered normal
or acceptable server performance levels?

The pmprocs utility, which is available for UNIX systems only, shows
the currently executing PowerCenter processes.

Pmprocs is a script that combines the ps and ipcs commands. It is
available through Informatica Technical Support. The utility provides
the following information:
- CPID - Creator PID (process ID)




- LPID - Last PID that accessed the resource
- Semaphores - used to sync the reader and writer
- 0 or 1 - shows slot in LM shared memory
(See Chapter 16 in the PowerCenter Administrator's Guide for
additional details.)

Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an
Oracle instance crash?

If the UNIX server crashes, you should first check to see if the
Repository Database is able to come back up successfully. If this is the
case, then you should try to start the PowerCenter server. Use the
pmserver.err log to check if the server has started correctly. You can
also use ps -ef | grep pmserver to see if the server process (the Load
Manager) is running.


Q: What recommendations or considerations exist as to naming standards or
repository administration for metadata that might be extracted from the
PowerCenter repository and used in others?

With PowerCenter, you can enter description information for all repository
objects, sources, targets, transformations, etc, but the amount of metadata
that you enter should be determined by the business requirements. You can
also drill down to the column level and give descriptions of the columns in a
table if necessary. All information about column size and scale, datatypes,
and primary keys are stored in the repository.

The decision on how much metadata to create is often driven by project
timelines. While it may be beneficial for a developer to enter detailed
descriptions of each column, expression, variable, etc, it is also very time
consuming to do so. Therefore, this decision should be made on the basis of
how much metadata will be required by the systems that use the metadata.

Q: What procedures exist for extracting metadata from the repository?

Informatica offers an extremely rich suite of metadata-driven tools for data
warehousing applications. All of these tools store, retrieve, and manage their
metadata in Informatica's central repository. The motivation behind the
original Metadata Exchange (MX) architecture was to provide an effective and
easy-to-use interface to the repository.

Today, Informatica and several key Business Intelligence (BI) vendors,
including Brio, Business Objects, Cognos, and MicroStrategy, are effectively
using the MX views to report and query the Informatica metadata.

Informatica does not recommend accessing the repository directly, even for
SELECT access. Rather, views have been created to provide access to the
metadata stored in the repository.







Data Cleansing


Accuracy is one of the biggest obstacles blocking the success of many data
warehousing projects. If users discover data inconsistencies, the user community
may lose faith in the entire warehouse’s data. However, it is not unusual to discover
that as many as half the records in a database contain some type of information that
is incomplete, inconsistent, or incorrect. The challenge is therefore to cleanse data
online, at the point of entry into the data warehouse or operational data store (ODS),
to ensure that the warehouse provides consistent and accurate data for business
decision making.


Informatica has several partners in the data cleansing arena. The partners
and respective tools include the following:

DataMentors - Provides tools that are run before the data extraction and
load process to clean source data. Available tools are :

• DMDataFuseTM

- a data cleansing and householding system with the
power to accurately standardize and match data.

• DMValiDataTM

- an effective, data analysis system that profiles and
identifies inconsistencies between data and metadata.
• DMUtils - a powerful non-compiled scripting language that operates on
flat ASCII or delimited files. It is primarily used as a query and
reporting tool. It also provides a way to reformat and summarize files.

FirstLogic – FirstLogic offers direct interfaces to PowerCenter during the
extract and load process as well as providing pre-data extraction data
cleansing tools like DataRight and Merge/Purge. The online interface (ACE
Library) integrates the TrueName Library and Merge/Purge Library of
FirstLogic, as Transformation Components, using the Informatica External
Procedures protocol. Thus, these components can be invoked for parsing,
standardization, cleansing, enhancement, and matching of the name and
address information during the PowerCenter ETL stage of building a data mart
or data warehouse.




Paladyne – The flagship product, Datagration is an open, flexible data quality
system that can repair any type of data (in addition to its name and address)
by incorporating custom business rules and logic. Datagration's Data
Discovery Message Gateway feature assesses data cleansing requirements
using automated data discovery tools that identify data patterns. Data
Discovery enables Datagration to search through a field of free form data and
re-arrange the tokens (i.e., words, data elements) into a logical order.
Datagration supports relational database systems and flat files as data
sources and any application that runs in batch mode.

Vality – Provides a product called Integrity, which identifies business
relationships (such as households) and duplications, reveals undocumented
business practices, and discovers metadata/field content discrepancies. It
offers data analysis and investigation, conditioning, and unique probabilistic
and fuzzy matching capabilities.

Vality is in the process of developing a "TX Integration" to PowerCenter.
Delivery of this bridge was originally scheduled for May 2001, but no further
information is available at this time.

Trillium – Trillium’s eQuality customer information components (a web
enabled tool) are integrated with Informatica’s Transformation Exchange
modules and reside on the same server as Informatica’s transformation
engine. As a result, Informatica users can invoke Trillium’s four data quality
components through an easy-to-use graphical desktop object. The four
components are :

• Converter: data analysis and investigation module for discovering
word patterns and phrases within free form text
• Parser: processing engine for data cleansing, elementizing and
standardizing customer data
• Geocoder: an Internationally-certified postal and census module for
address verification and standardization
• Matcher: a module designed for relationship matching and record

Integration Examples

This following sections describe how to integrate two of the tools with

FirstLogic – ACE

The following graphic illustrates a high level flow diagram of the data
cleansing process.




Use the Informatica Advanced External Transformation process to interface
with the FirstLogic module by creating a “Matching Link” transformation. That
process uses the Informatica Transformation Developer to create a new
Advanced External Transformation, which incorporates the properties of the
FirstLogic Matching Link files. Once a Matching Link transformation has been
created in the Transformation Developer, users can incorporate that
transformation into any of their project mappings: it's reusable from the

When an Informatica session starts, the transformation is initialized. The
initialization sets up the address processing options, allocates memory, and
opens the files for processing. This operation is only performed once. As each
record is passed into the transformation it is parsed and standardized. Any
output components are created and passed to the next transformation. When
the session ends, the transformation is terminated. The memory is once again
available and the directory files are closed.

The available functions / processes are as follows.

ACE Processing

There are four ACE transformations available to choose from. They will parse,
standardize and append address components using Firstlogic’s ACE Library.
The transformation choice depends on the input record layout. A fourth
transformation can provide optional components. This transformation must be
attached to one of the three base transformations.

The four transforms are:

1. ACE_discrete - where the input address data is presented in discrete


2. ACE_multiline - where the input address data is presented in multiple
lines (1-6).
3. ACE_mixed - where the input data is presented with discrete
city/state/zip and multiple address lines(1-6).
4. Optional transform – which is attached to one of the three base
transforms and outputs the additional components of ACE for




All records input into the ACE transformation are returned as output. ACE
returns Error/Status Code information during the processing of each address.
This allows the end user to invoke additional rules before the final load is

TrueName Process

TrueName mirrors the ACE transformation options with discrete, multi-line
and mixed transformations. A fourth and optional transformation available in
this process can be attached to one of the three transformations to provide
genderization and match standards enhancements. TrueName will generate
error and status codes. Similar to ACE, all records entered as input into the
TrueName transformation can be used as output.

Matching Process

The matching process works through one transformation within the
Informatica architecture. The input data is read into the Informatica data
flow similar to a batch file. All records are read, the break groups created
and, in the last step, matches are identified. Users set-up their own matching
transformation through the PowerCenter Designer by creating an advanced
external procedure transformation. Users are able to select which records are
output from the matching transformations by editing the initialization
properties of the transformation.

All matching routines are predefined and, if necessary, the configuration files
can be accessed for additional tuning. The five predefined matching scenarios
include: individual, family, household (the only difference between household
and family, is the household doesn't match on last name), firm individual, and
firm. Keep in mind that the matching does not do any data parsing, this must
be accomplished prior to using this transformation. As with ACE and
TrueName, error and status codes are reported.


Integration to Trillium’s data cleansing software is achieved through the
Informatica Trillium Advanced External Procedures (AEP) interface.

The AEP modules incorporate the following Trillium functional components.

• Trillium Converter – The Trillium Converter facilitates data
conversion such as EBCDIC to ASCII, integer to character, character
length modification, literal constant and increasing values. It may also
be used to create unique record identifiers, omit unwanted
punctuation, or translate strings based on actual data or mask values.
A user-customizable parameter file drives the conversion process. The
Trillium Converter is a separate transformation that can be used
standalone or in conjunction with the Trillium Parser module.
• Trillium Parser – The Trillium Parser identifies and/or verifies the
components of free-floating or fixed field name and address data. The
primary function of the Parser is to partition the input address records




into manageable components in preparation for postal and census
geocoding. The parsing process is highly table- driven to allow for
customization of name and address identification to specific
• Trillium Postal Geocoder – The Trillium Postal Geocoder matches an
address database to the ZIP+4 database of the U.S. Postal Service
• Trillium Census Geocoder – The Trillium Census Geocoder matches
the address database to U.S. Census Bureau information.

Each record that passes through the Trillium Parser external module is first
parsed and then, optionally, postal geocoded and census geocoded. The level
of geocoding performed is determined by a user-definable initialization

• Trillium Window Matcher – The Trillium Window Matcher allows the
PowerCenter Server to invoke Trillium’s deduplication and house
holding functionality. The Window Matcher is a flexible tool designed to
compare records to determine the level of likeness between them. The
result of the comparisons is considered a passed, a suspect, or a failed
match depending upon the likeness of data elements in each record,
as well as a scoring of their exceptions.

Input to the Trillium Window Matcher transformation is typically the sorted
output of the Trillium Parser transformation. The options for sorting include:

• Using the Informatica Aggregator transformation as a sort engine.
• Separate the mappings whenever a sort is required. The sort can be
run as a pre/post session command between mappings. Pre/post
sessions are configured in the Server Manager.
• Build a custom AEP Transformation to include in the mapping.




You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->