You are on page 1of 954

Velocity v8

Best Practices

Best Practices

B2B Data Exchange


r

B2B Data Transformation Installation (for Unix) B2B Data Transformation Installation (for Windows) Deployment of B2B Data Transformation Services Establishing a B2B Data Transformation Development Architecture Testing B2B Data Transformation Services Configuring Security Data Analyzer Security Database Sizing Deployment Groups Migration Procedures - PowerCenter Migration Procedures - PowerExchange Running Sessions in Recovery Mode Using PowerCenter Labels Deploying Data Analyzer Objects Installing Data Analyzer Data Connectivity using PowerCenter Connect for BW Integration Server Data Connectivity using PowerExchange for WebSphere MQ Data Connectivity using PowerExchange for SAP NetWeaver Data Connectivity using PowerExchange for Web Services Data Migration Principles
BEST PRACTICES 2 of 954

Configuration Management and Security


r

Data Analyzer Configuration


r

Data Connectivity
r

Data Migration
r

INFORMATICA CONFIDENTIAL

Data Migration Project Challenges Data Migration Velocity Approach Build Data Audit/Balancing Processes Continuing Nature of Data Quality Data Cleansing Data Profiling Data Quality Mapping Rules Data Quality Project Estimation and Scheduling Factors Developing the Data Quality Business Case Effective Data Matching Techniques Effective Data Standardizing Techniques Integrating Data Quality Plans with PowerCenter Managing Internal and External Reference Data Real-Time Matching Using PowerCenter Testing Data Quality Plans Tuning Data Quality Plans Using Data Explorer for Data Discovery and Analysis Working with Pre-Built Plans in Data Cleanse and Match Designing Data Integration Architectures Development FAQs Event Based Scheduling Key Management in Data Warehousing Solutions Mapping Auto-Generation Mapping Design Mapping SDK Mapping Templates Naming Conventions Naming Conventions - B2B Data Transformation

Data Quality and Profiling


r

Development Techniques
r

INFORMATICA CONFIDENTIAL

BEST PRACTICES

3 of 954

Naming Conventions - Data Quality Performing Incremental Loads Real-Time Integration with PowerCenter Session and Data Partitioning Using Parameters, Variables and Parameter Files Using PowerCenter with UDB Using Shortcut Keys in PowerCenter Designer Working with JAVA Transformation Object Error Handling Process Error Handling Strategies - Data Warehousing Error Handling Strategies - General Error Handling Techniques - PowerCenter Mappings Error Handling Techniques - PowerCenter Workflows and Data Analyzer Business Case Development Canonical Data Modeling Chargeback Accounting Engagement Services Management Information Architecture People Resource Management Planning the ICC Implementation Proposal Writing Selecting the Right ICC Model Creating Inventories of Reusable Objects & Mappings Metadata Reporting and Sharing Repository Tables & Metadata Management Using Metadata Extensions Using PowerCenter Metadata Manager and Metadata Exchange Views
BEST PRACTICES 4 of 954

Error Handling
r

Integration Competency Centers and Enterprise Architecture


r

Metadata and Object Management


r

INFORMATICA CONFIDENTIAL

for Quality Assurance


q

Metadata Manager Configuration


r

Configuring Standard Metadata Resources Custom XConnect Implementation Customizing the Metadata Manager Interface Estimating Metadata Manager Volume Requirements Metadata Manager Business Glossary Metadata Manager Load Validation Metadata Manager Migration Procedures Metadata Manager Repository Administration Upgrading Metadata Manager Daily Operations Data Integration Load Traceability Disaster Recovery Planning with PowerCenter HA Option High Availability Load Validation Repository Administration Third Party Scheduler Updating Repository Statistics Determining Bottlenecks Performance Tuning Databases (Oracle) Performance Tuning Databases (SQL Server) Performance Tuning Databases (Teradata) Performance Tuning in a Real-Time Environment Performance Tuning UNIX Systems Performance Tuning Windows 2000/2003 Systems Recommended Performance Tuning Procedures Tuning and Configuring Data Analyzer and Data Analyzer Reports

Operations
r

Performance and Tuning


r

INFORMATICA CONFIDENTIAL

BEST PRACTICES

5 of 954

Tuning Mappings for Better Performance Tuning Sessions for Better Performance Tuning SQL Overrides and Environment for Better Performance Using Metadata Manager Console to Tune the XConnects Advanced Client Configuration Options Advanced Server Configuration Options Causes and Analysis of UNIX Core Files Domain Configuration Managing Repository Size Organizing and Maintaining Parameter Files & Variables Platform Sizing PowerCenter Admin Console PowerCenter Enterprise Grid Option Understanding and Setting UNIX Resources for PowerCenter Installations PowerExchange for Oracle CDC PowerExchange for SQL Server CDC PowerExchange Installation (for AS/400) PowerExchange Installation (for Mainframe) Assessing the Business Case Defining and Prioritizing Requirements Developing a Work Breakdown Structure (WBS) Developing and Maintaining the Project Plan Developing the Business Case Managing the Project Lifecycle Using Interviews to Determine Corporate Data Integration Requirements

PowerCenter Configuration
r

PowerExchange Configuration
r

Project Management
r

Upgrades

INFORMATICA CONFIDENTIAL

BEST PRACTICES

6 of 954

Upgrading Data Analyzer Upgrading PowerCenter Upgrading PowerExchange

INFORMATICA CONFIDENTIAL

BEST PRACTICES

7 of 954

B2B Data Transformation Installation (for Unix) Challenge


Install and configure B2B Data Transformation on new or existing hardware, either in conjunction with PowerCenter or co-existing with other host applications on the same application server. Note: B2B Data Transformation (B2BDT) was formerly called Complex Data Exchange (CDE). All references to CDE in this document are now referred to as B2BDT.

Description
Consider the following questions when determining what type of hardware to use for B2BDT: If the hardware already exists: 1. 2. 3. 4. 5. Is the processor, operating system supported by B2BDT? Are the necessary operating system and patches applied? How many CPUs does the machine currently have? Can the CPU capacity be expanded? How much memory does the machine have? How much is available to the B2BDT application? Will B2BDT share the machine with other applications? If yes, what are the CPU and memory requirements of the other applications?

If the hardware does not already exist: 1. Has the organization standardized on hardware or operating system vendor? 2. What type of operating system is preferred and supported? Regardless of the hardware vendor chosen, the hardware must be configured and sized appropriately to support the complex data transformation requirements for B2BDT. The hardware requirements for the B2BDT environment depend upon the data volumes, number of concurrent users, application server and operating system used, among other factors. For exact sizing recommendations, contact Informatica Professional Services for a B2BDT Sizing and Baseline Architecture engagement.

Planning for B2BDT Installation


There are several variations on the hosting environment from which B2BDT services will be called. This has implications on how B2BDT is installed and configured.

Host Software Environment


The most common configurations are:
q q q

B2BDT to be used in conjunction with PowerCenter B2BDT as a stand alone configuration B2BDT in conjunction with non-PowerCenter integration using an adapter for other middleware software such as WebMethods

In addition, B2BDT 4.4 included a mechanism for exposing B2BDT services through web services so that they could be called from applications capable of calling web services. Depending on what host options are chosen, installation may vary.

Installation of B2BDT for a PowerCenter Host Environment


INFORMATICA CONFIDENTIAL BEST PRACTICES 8 of 954

Be sure to have the necessary Licenses and the additional plug-in to make PowerCenter work. Refer to the appropriate installation guide or contact Informatica support for details on installing B2BDT in PowerCenter environments.

Installation of B2BDT for a Standalone Environment


When using B2BDT services in a standalone environment, it is expected that one of the invocation methods (e.g., web services, . Net, Java APIs, command line or CGI) will be used to invoke B2BDT services. Consult accompanying B2BDT documentation for use in these environments.

Non-PowerCenter Middleware Platform Integration


Be sure to plan for additional agents to be installed. Refer to the appropriate installation guide or contact Informatica support for details for installing B2BDT in environments other than PowerCenter.

Other Decision Points


Where will the B2BDT service repository be located? The choices for the location of the service repository are i) a path on the local file system or ii) use of a shared network drive. The justification for using a shared network drive is typically to simplify service deployment if two separate B2BDT servers want to share the same repository. While the use of a shared repository is convenient for a multi-server production environment it is not advisable for development as there could be a danger of multiple development teams potentially overwriting the same project files. When a repository is shared between multiple machines, if a service is deployed via the B2BDT Studio, the Service Refresh Interval setting controls how fast other installations of B2BDT that are currently running detect the deployment of a service. What are multi-user considerations? If multiple users share a machine (but not at same time) the environment variable IFConfigLocation4 can be used to set the location of the configuration file to point to a different configuration file for each user.

Security Considerations
As the B2BDT repository, workspace and logging locations are directory-based all directories to be used should be granted read and write permissions for the user identity under which the B2BDT service will run. The identity associated with the caller of the B2BDT services will also need to have permissions to execute the files installed in B2BDT binary directory. Special considerations should be given to environments such as web services where the user identify under which the B2BDT service runs is set to be different for the interactive user or the user associated with the calling application.

Log File and Tracing Locations


Log files and tracing options should be configured for appropriate recycling policies. The calling application must have permissions to read, write and delete files to the path that is set for storing these files.

B2BDT Pre-install Checklist


B2BDT has client and server components. Only the server (or engine) component is installed on UNIX platforms. The client or development studio is only supported on the Windows platform. Reviewing the environment and recording the information in a detailed checklist facilitates the B2BDT install.

Minimum System Requirements


Verify that the minimum requirements for Operating System, Disk Space, Processor Speed and RAM are met and record them in
INFORMATICA CONFIDENTIAL BEST PRACTICES 9 of 954

the checklist. Verify the following:


q

B2BDT requires a Sun Java 2 Runtime Environment (version 1.5.X or above). B2BDT bundles with the appropriate JRE version. The installer can be pointed to an existing JRE or a JRE can be downloaded from Sun.
r r

If the server platform is AIX, Solaris or Linux JRE version 1.5 or higher is installed and configured. If the server platform is HP-UX, JRE version 1.5 or higher and the Java -AA add-on is installed and configured.

q q

A login account and directory have been created for the installation Confirm that the profile file is not write-protected. The setup program needs to update the profile.
r r

~/.profile if you use the sh, ksh, or bash shell ~/.cshrc or ~/.tcshrc if you use the csh or tcsh shell

q q

500Mb or more of temporary workspace is available. Data and Stack Size


r r

If the server platform is Linux, the data and the stack size are not limited If the server platform is AIX, the data size is not limited.

PowerCenter Integration Requirements


Complete a separate checklist for integration if you plan to integrate B2BDT with PowerCenter. For an existing PowerCenter installation, the B2BDT client will need to be installed on at least one PC in which the PowerCenter client resides. Also, B2BDT components will need to be installed on the PowerCenter server. If utilizing an existing PowerCenter installation ensure the following:
q q q q

Which version of PowerCenter is being used (8.x required)? Is the PowerCenter version 32 bit or 64 bit? Are the PowerCenter client tools installed on the client PC? Is the PowerCenter server installed on the server?

For new PowerCenter installations, the PowerCenter Pre-Install Checklist needs to be completed. Keep in mind that the same hardware will be utilized for both PowerCenter and B2BDT.

Non-PowerCenter Integration requirements


In addition to general B2BDT requirements, non-PowerCenter agents require that additional components are installed. B2BDT Agent for BizTalk - requires that Microsoft BizTalk Server (version 2004 or 2006) is installed on the same computer as B2BDT. If B2BDT Studio is installed on the same computer as BizTalk Server 2004, the Microsoft SP2 service pack for BizTalk Server must be installed. B2BDT Translator for Oracle BPEL - requires that BPEL 10.1.2 or above is installed. B2BDT Agent for WebMethods - requires that WebMethods 6.5 or above is installed. B2BDT Agent for WebSphere Business Integration Message Broker requires that WBIMB 5.0 with CSD06 (or WBIMB 6.0) is installed. Also ensure that the platform supports both the B2BDT Engine and WBIMB. A valid license key is needed to run a B2BDT project and must be installed before B2BDT services will run on the computer. Contact Informatica support to obtain a B2BDT license file (B2BDTLicense.cfg). B2BDT Studio can be used without installing a license file.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

10 of 954

B2BDT Installation and Configuration


The B2BDT installation process involves two main components - the B2BDT development workbench (Studio) and the B2BDT Server, which is an application deployed on a server. The installation tips apply to UNIX environments. This section should be used as a supplement to the B2B Data Transformation Installation Guide. Before installing B2BDT, complete the following steps:
q

Verify that the hardware meets the minimum system requirements for B2BDT. Ensure that the combination of hardware and operating system are supported by B2BDT. Ensure that sufficient space has been allocated to the B2BDT serviceDB. Apply all necessary patches to the operating system. Ensure that the B2BDT license file has been obtained from technical support. Be sure to have administrative privileges for the installation user id. For *nix systems ensure that read, write and executive privileges have been given for the installation directory.

q q q

Adhere to following sequence of steps to successfully install B2BDT. 1. Complete the B2BDT pre-install checklist and obtain valid license keys. 2. Install B2BDT development workbench (studio) on the windows platform. 3. Install the B2BDT server on a server machine. When used in conjunction with PowerCenter, the server component must be installed on the same physical machine where PowerCenter resides. 4. Install necessary client agents when used in conjunction with Websphere, WebMethods and Biztalk

B2BDT Install Components


q q q q q

B2B Data Transformation Studio B2B Data Transformation Engine Processors Optional agents Optional libraries

The table below provides descriptions of each component: Component Engine Applicable Platform Both UNIX and Windows Description The runtime module that executes B2BDT data transformations. This module is required in all B2BDT installations. The design and configuration environment for creating and deploying data transformations. B2BDT Studio is hosted within Eclipse on Windows platforms. The Eclipse setup is included in the B2BDT installation package. A set of components that perform global processing operations on documents, such as transforming their file formats. All the document processors run on Windows platforms, and most of them run on UNIX-type platforms. Libraries of predefined B2BDT data transformations, which can be used with industry messaging standards such as EDI, ACORD, HL7, HIPAA, and SWIFT. Each library contains parsers, serializers, and XSD schemas for the appropriate messaging standard. The libraries can be installed on Windows platforms. B2BDT Studio can be used to import the library components to projects, and deploy the projects to Windows or UNIX-type platforms.

Studio

Windows only

Document Processors

Both UNIX and Windows

Libraries

Windows only (see description)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

11 of 954

Documentation

Windows only

An online help library, containing all the B2BDT documentation. PDF version of documentation available for UNIX platform

Install the B2BDT Engine Step 1:


Run the UNIX installation file from the software folder on the installation CD and follow the prompts. Follow the wizard to complete the install. TIP During the installation a language must be selected. If there are plans to change the language at a later point in time in the Configuration Editor, Informatica recommends that a non-English language is chosen for the initial setup. If English is selected and then later changed to another language some of the services that are required for other languages might not be installed.

B2BDT supports all of the major UNIX-type systems (e.g., Sun Solaris, IBM AIX, Linux and HP-UX). On UNIX-type operating systems, the installed components are the B2BDT Engine and the document processors. Note: On UNIX-type operating systems, do not limit the data size and the stack size. To determine whether there is currently a limitation, run the following command: For AIX, HP, and Solaris: ulimit a For Linux: limit If very large documents are processed using B2BDT, try adjusting system parameters such as the memory size and the file size. There are two install modes possible under UNIX -- Graphical Interface and Console Mode. The default installation path is /opt/Informatica/ComplexDataExchange.

The default Service Repository Path is <INSTALL_DIR>/ServiceDB. This is the storage location for data transformations that are
INFORMATICA CONFIDENTIAL BEST PRACTICES 12 of 954

deployed as B2BDT services. The default Log path is <INSTALL_DIR>/CMReports. Log Path page is the location where the B2BDT Engine should store its log files. The log path is also known as the reports path. The repository location, JRE path and Log path can be changed subsequent to the installation using environment variables.

Step 2:
Install the license file. Verify the validity of the license file with the following command: CM_console v The system displays information such as the location and validity of the license file (sample output shown below): $ ./bin/CM_console v Version: 4.4.0(Build:186) Syntax version:4.00.10 Components: Engine Processors Configuration file: /websrvr/informatica/ComplexDataExchange/CMConfig.xml Package identifier: IF_AIX_OS64_pSeries_C64 License information: License-file path: /websrvr/informatica/ComplexDataExchange/CDELicense.cfg Expiration date: 21/02/08 (dd/mm/yyyy) Maximum CPUs: 1 Maximum services: 1 Licensed components: Excel,Pdf,Word,Afp,Ppt

Step 3:
Load the Environment Variables. When the setup is complete, configure the system to load the B2BDT environment variables. The B2BDT setup assigns several environment variables that point to the installation directory and to other locations that the system needs. On UNIX-type platforms, the system must be configured to load the environment variables. B2BDT cannot run until this is done. B2BDT setup creates an environment variables file. This can be in either of the following ways: Manually from the command line In lieu of loading environment variables automatically, they can be loaded manually from the command line. This must be done upon each log in before using B2BDT. For sh, ksh, or bash shell, the command is: ./<INSTALL_DIR>/setEnv.sh For csh or tcsh shell, the command is: source /<INSTALL_DIR>/setEnv.csh Substitute the installation path for <INSTALL_DIR> as necessary. Automatically by inserting the appropriate command in the profile or in a script file To configure the system to load the environment variables file automatically upon log in:
INFORMATICA CONFIDENTIAL BEST PRACTICES 13 of 954

For the sh, ksh, or bash shell, insert the following line in the profile file. . /<INSTALL_DIR>/setEnv.sh For csh or tcsh shell, insert the following line in the login file. source /<INSTALL_DIR>/setEnv.csh On UNIX-type platforms, B2BDT uses the following environment variables. Environment Variable PATH Required/Optional Purpose of the Variable Required The environment variables file adds <INSTALL_DIR>/ bin to the paths. Note: In rare instances, the B2BDT Java document processors require that the JRE be added to the path. On AIX: LIBPATH On Solaris and Linux: LD_LIBRARY_PATH On HP-UX: SHLIB_PATH and LD_LIBRARY_PATH Required The environment variables file adds the installation directory (<INSTALL_DIR>) to the library path. It also adds the JVM directory of the JRE and its parent directory to the path, for example, <INSTALL_DIR>/ jre1.4/lib/sparc/server and <INSTALL_DIR>/jre1.4/lib/ sparc. This value can be edited to use another compatible JRE. Required CLASSPATH Required IFCONTENTMASTER_HOME Optional IFConfigLocation4 The environment variables file creates this variable, which points to the B2BDT installation directory (<INSTALL_DIR>). The path of the B2BDT configuration file. This for multiple configurations. The environment variables file adds <INSTALL_DIR>/ api/lib/CM_JavaAPI.jar to the Java class path.

The following is an example of an environment variables file (setEnv.csh) on an AIX system. The variable names and values differ slightly on other UNIX-type operating systems. ## B2B Data Transformation Environment settings setenv IFCMPath /opt/Informatica/ComplexDataExchange setenv CMJAVA_PATH /opt/Informatica/ComplexDataExchange/jre1.4/jre/bin/classic: /opt/Informatica/ComplexDataExchange/jre1.4/jre/bin # Prepend B2B Data Transformation to the PATH if ( ! $?PATH ) then setenv PATH "" endif setenv PATH "${IFCMPath}/bin:${PATH}" # Add CM & java path to LIBPATH if ( ! $?LIBPATH ) then
INFORMATICA CONFIDENTIAL BEST PRACTICES 14 of 954

setenv LIBPATH "" endif setenv LIBPATH "${IFCMPath}/bin:${CMJAVA_PATH}:${LIBPATH}" # Update IFCONTENTMASTER_HOME. setenv IFCONTENTMASTER_HOME "${IFCMPath}" # Prepend CM path CLASSPATH if ( ! $?CLASSPATH ) then setenv CLASSPATH "" endif setenv CLASSPATH "${IFCMPath}/api/lib/CM_JavaAPI.jar:.:${CLASSPATH}"

Step 4:
Configuration settings Directory Locations During the B2BDT setup, prompts were completed for the directory locations of the B2BDT repository, log files and JRE. If necessary, alter these locations by editing the following parameters: Parameter CM Configuration/ Directory services/ File system/Base Path CM Configuration/ CM Engine/ JVM Location CM Configuration/ General/ Reports directory CM Configuration/ CM Engine/ Invocation CM Configuration/ CM Engine/ CM Server Explanation The B2BDT repository location, where B2BDT services are stored.

On UNIX: This parameter is not available in the Configuration Editor on UNIX-type platforms. For more information about setting the JRE on UNIX, see UNIX Environment Variable Reference. The log path, also called the reports path, where B2BDT saves event logs and certain other types of reports.

These settings control whether B2BDT Engine runs in-process or out-of-process.

B2BDT has a Configuration Editor, for editing the parameters of a B2BDT installation. To open the Configuration Editor on UNIX in graphical mode: Enter the following command: <INSTALL_DIR>/CMConfig Note: The Configuration Editor is not supported in a UNIX console mode. Some of the Configuration Editor settings are available for all B2BDT installations. Some additional settings vary depending on the B2BDT version and on the optional components that have been installed. The Configuration Editor saves the configuration in an XML file. By default, the file is <INSTALL_DIR/CMConfig.xml>. Note: Before editing the configuration save a backup copy of CMConfig.xml. In the event of a problem the backup can be restored. The file <INSTALL_DIR>/CMConfig.bak is a backup of the original <INSTALL_DIR/CMConfig.xml> which the setup program
INFORMATICA CONFIDENTIAL BEST PRACTICES 15 of 954

created when B2BDT was installed. Restoring CMConfig.bak reverts B2BDT to its original configuration. OS environment variables are used to set aspects of the system such as the Java classpath to be used, location of the configuration file for a specific user, home location for the installed B2BDT instance to be used, library paths, etc. The following table lists some typical configuration items and where they are set: Type of configuration item Memory for Studio JVM / JRE usage Tuning parameters threads, timeouts etc User specific settings Memory for runtime Workspace location Where configured B2BDT Configuration application B2BDT Configuration application B2BDT Configuration application Use environment variable to point to different configuration file B2BDT Configuration application B2BDT Configuration application (B2BDT 4.3), B2BDT Studio (B2BDT 4.4) Set in project properties B2BDT Configuration application

Event generation Repository location

In-Process or Out-of-Process Invocation Out-of-process invocation requires the use of the B2BDT Server application (which is already installed by the install process). The distinction is that running under server mode causes transformations to potentially run slower, but errors will be isolated from the calling process. For Web Services, sometime the use of Server mode is recommended as the lifetime of the host process then becomes independent of the life time of the process space allocated to run the web service. For example IIS can run web services in a mode where a process dies or is recycled after a call to web services. For B2BDT the first call after a process startup can take up to 3 seconds (subsequent calls are usually milliseconds) hence it is not optimal to start host process on each invocation. Running in server mode keeps process lifetimes independent. TIP B2BDT Studio or the CM_console command always runs data transformations in-process.

Running out-of-process has the following advantages:


q q q

Allows 64-bit processes to activate 32-bit versions of B2BDT Engine An Engine failure is less likely to disrupt the calling application Helps prevent binary collisions with other modules that run in the process of the calling application.

In-process invocation has the following advantage:


q

Faster performance than out-of-process.

Thread pool settings


INFORMATICA CONFIDENTIAL BEST PRACTICES 16 of 954

The thread pool controls the maximum number of Engine threads that can run client requests concurrently per process. If the number of client requests exceeds the number of available threads, the Server queues the requests until a thread is available. The default setting is 4. Some recommendations are summarized in the table below. Actual needs vary depending upon requirements. Best practices and additional recommendations are part of Jumpstart and Base Line Architecture engagements. Contact an Informatica representative for additional information.

Step 5:
Configure ODBC connectivity. Note: This step is only needed if the ODBC database support features of B2BDT will be used. In such case, an ODBC driver may need to be configured.

Step 6:
Test the installation to confirm that B2BDT operates properly. Note: Tests are available to test the engine and document processor installation. Refer the directory <INSTALL_DIR>\setupTests for B2BDT test projects testCME and testCMDP. Sample output would be similar to following: cd $IFCONTENTMASTER_HOME cp -R setupTests/testCME ServiceDB/ CM_console testCME <Result>Test Succeeded</Result>

B2BDT Integration with PowerCenter


B2BDT does support using the runtime as a server process to be invoked from PowerCenter on the same physical machine (in addition to offering the ability to invoke the B2BDT runtime engine in-process with the calling environment). While this does constitute a server process in developer terminology, it does not provide full server administration or monitoring capabilities that are typical of enterprise server products. It is the responsibility of the calling environment to provide these features. Part of the overall solution architecture is to define how these facilities are mapped to a specific B2BDT implementation.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

17 of 954

Installation of B2BDT for PowerCenter is a straightforward process. All required plugins needed to develop and run B2BDT transformation are installed as part of a PowerCenter installation. However, certain B2BDT versions and B2BDT plugins need to be registered after the install process. Refer to the PowerCenter Installation Guide for details.

Note: A PowerCenter UDO Transformation can only be created if the UDO plug-in is successfully registered in the repository. If the UDO Option is installed correctly then UDO Transformations can be created in the PowerCenter Designer. Note: ODBC drivers provided for PowerCenter are not automatically usable from B2BDT as licensing terms prohibit this in some cases. Contact an Informatica support representative for further details.
INFORMATICA CONFIDENTIAL BEST PRACTICES 18 of 954

Additional Details for PowerCenter Integration

q q q q

B2BDT is a Custom Transformation object within PowerCenter INFA passes data via memory buffers to the B2BDT engine and retrieves that output via buffers. The B2BDT engine runs IN-PROCESS with the PowerCenter engine. The Custom Transformation object for Informatica can be dragged and dropped inside a PowerCenter mapping.

When using B2BDT transformation, PowerCenter does NOT process the input files directly, but instead takes a path and filename (from a text file). Then the engine processes the data through the B2BDT parser defined within the mapping. After this the data is returned to the PowerCenter B2BDT transformation for processing by other Informatica transformation objects. TIP Verify that the Source filename is the name of the text file where both the file path and the file name are present. It can not be the actual file being parsed by PowerCenter and B2BDT. This is the direct versus indirect sourcing of the file.

Useful Tips and Tricks


Version Compatibility?
q q

Ensure that version of B2BDT is compatible with PowerCenter. Otherwise many issues can manifest in different forms. In general B2BDT 4.4 is compatible with 8.5 and with 8.1.1 SP4 (and SP4 only), B2BDT 4.0.6 is compatible with 8.1.1, and B2BDT 3.2 is compatible with PC 7.x. For more information refer to the Product Availability Matrix.

Service Deployment?
q q

Ensure that services are deployed on a remote machine where PowerCenter installed. Services deployed from studio show up in PowerCenter designer B2BDT transformation as a dropdown list (see screenshot below).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

19 of 954

Note: These services are only ones that are deployed on a local machine. If any services are deployed on remote machines the designer will not display them. As it is easy to mistake them for remote services manually ensure that the services for local and remote machines are in sync.
q

After making certain that the services are deployed in the remote server that also has PowerCenter installed, the B2BDT transformation can be specified from the UDO (8.1.1) or B2BDT (8.5) transformation on the Metadata Extensions tab.

Last updated: 02-Jun-08 16:23

INFORMATICA CONFIDENTIAL

BEST PRACTICES

20 of 954

B2B Data Transformation Installation (for Windows) Challenge


Installing and configuring B2B Data Transformation (B2BDT) on new or existing hardware, either in conjunction with PowerCenter or co-existing with other host applications on the same server. Note: B2B Data Transformation was formerly called Complex Data Exchange (CDE). Any references to PowerExchange Complex Data Exchange in this document are now referred to as B2B Data Transformation (B2BDT).

Description
Consider the following questions when determining what type of hardware to use for B2BDT: If the hardware already exists: 1. 2. 3. 4. 5. Is the processor, operating system supported by B2BDT? Are the necessary operating system and patches applied? How many CPUs does the machine currently have? Can the CPU capacity be expanded? How much memory does the machine have? How much is available to the B2BDT application? Will B2BDT share the machine with other applications? If yes, what are the CPU and memory requirements of other applications?

If the hardware does not already exist: 1. Has the organization standardized on hardware or operating system vendor? 2. What type of operating system is preferred and supported? Regardless of the hardware vendor chosen, the hardware must be configured and sized appropriately to support the complex data transformation requirements for B2BDT. Among other factors, the hardware requirements for the B2BDT environment depend upon the data volumes, the number of concurrent users and the application server and operating system used. For exact sizing recommendations, contact Informatica Professional Services for a B2BDT Sizing and Baseline Architecture engagement.

Planning for the B2BDT Installation


There are several variations of the hosting environment from which B2BDT services will be invoked. This has implications on how B2BDT is installed and configured.

Host Software Environment


The most common configurations are: 1. B2BDT to be used in conjunction with PowerCenter 2. B2BDT as a stand alone configuration 3. B2BDT in conjunction with a non-PowerCenter integration using an adapter for other middleware software such as WebMethods or Oracle BPEL.
INFORMATICA CONFIDENTIAL BEST PRACTICES 21 of 954

B2BDT 4.4 includes a mechanism for exposing B2BDT services through web services so that they can be called from applications capable of calling web services. Depending on what host options are chosen, installation options may vary.

Installation of B2BDT for a PowerCenter Host Environment


Be sure to have the necessary licenses and the additional plug-in to make PowerCenter work. Refer to the appropriate installation guide or contact Informatica support for details on the installation of B2BDT in PowerCenter environments.

Installation of B2BDT for a Standalone Environment


When using B2BDT services in a standalone environment, it is expected that one of the invocation methods (e.g., Web Services, .Net, Java APIs, Command Line or CGI) will be used to invoke B2BDT services. Consult accompanying B2BDT documentation for use in these environments.

Non-PowerCenter Middleware Platform Integration


Be sure to plan for additional agents to be installed. Refer to the appropriate installation guide or contact Informatica support for details for installing B2BDT in environments other than PowerCenter.

Other Decision Points


Where will the B2BDT service repository be located? The choices for the location of the service repository are i) a path on the local file system or ii) use of a shared network drive. The justification for using a shared network drive is typically to simplify service deployment if two separate B2BDT servers want to share the same repository. While the use of a shared repository is convenient for a multi-server production environment it is not advisable for development as there could be a danger of multiple development teams potentially overwriting the same project files. When a repository is shared between multiple machines, if a service is deployed via the B2BDT Studio, the Service Refresh Interval setting controls how fast other installations of B2BDT that are currently running detect the deployment of a service. What are multi-user considerations? If multiple users share a machine (but not at same time) the environment variable IFConfigLocation4 can be used to set the location of the configuration file to point to a different configuration file for each user.

Security Considerations
As the B2BDT repository, workspace and logging locations are directory-based all directories to be used should be granted read and write permissions for the user identity under which the B2BDT service will run. The identity associated with the caller of the B2BDT services will also need to have permissions to execute the files installed in B2BDT binary directory. Special considerations should be given to environments such as web services where the user identify under which the B2BDT service runs is set to be different for the interactive user or the user associated with the calling
INFORMATICA CONFIDENTIAL BEST PRACTICES 22 of 954

application.

Log File and Tracing Locations


Log files and tracing options should be configured for appropriate recycling policies. The calling application must have permissions to read, write and delete files to the path that is set for storing these files.

B2BDT Pre-install Checklist


It is best to review the environment and record the information in a detailed checklist to facilitate the B2BDT install.

Minimum System Requirements


Verify that the minimum requirements for the Operating System, Disk Space, Processor Speed and RAM are met and record them the checklist.
q q q q

B2BDT Studio requires Microsoft .NET Framework, version 2.0. If this version is not already installed, the installer will prompt for and install the framework automatically. B2BDT requires a Sun Java 2 Runtime Environment, version 1.5.X or above. B2BDT bundles with the appropriate JRE version. The installer can be pointed to an existing JRE or a JRE can be downloaded from Sun. To install the optional B2BDT libraries, reserve additional space (refer to documentation for additional information).

PowerCenter Integration Requirements


Complete the checklist for integration if B2BDT will be integrated with PowerCenter. For an existing PowerCenter installation, the B2BDT client needs to be installed on at least one PC on which the PowerCenter client resides. B2BDT components also need to be installed on the PowerCenter server. If utilizing an existing PowerCenter installation ensure the following:
q q q q

Which version of PowerCenter is being used (8.x required)? Is the PowerCenter version 32 bit or 64 bit? Are the PowerCenter client tools installed on the client PC? Is the PowerCenter server installed on the server?

For new PowerCenter installations, the PowerCenter Pre-Install Checklist should be completed. Keep in mind that the same hardware will be utilized for PowerCenter and B2BDT. For windows Server, verify the following:
q q q q

The login account used for the installation has local administrator rights. 500Mb or more of temporary workspace is available. The Java 2 Runtime Environment version 1.5 or higher is installed and configured. Microsoft .NET Framework, version 2.0 is installed.

Non-PowerCenter Integration Requirements


INFORMATICA CONFIDENTIAL BEST PRACTICES 23 of 954

In addition to the general B2BDT requirements, non-PowerCenter agents require that additional components are installed. B2BDT Agent for BizTalk - requires that Microsoft BizTalk Server (version 2004 or 2006) is installed on the same computer as B2BDT. If B2BDT Studio is installed on the same computer as BizTalk Server 2004, the Microsoft SP2 service pack for BizTalk Server must be installed. B2BDT Translator for Oracle BPEL - requires that BPEL 10.1.2 or above is installed. B2BDT Agent for WebMethods - requires that WebMethods 6.5 or above is installed. B2BDT Agent for WebSphere Business Integration Message Broker requires that WBIMB 5.0 with CSD06 (or WBIMB 6.0) are installed. Also ensure that the Windows platform supports both the B2BDT Engine and WBIMB. A valid license key is needed to run a B2BDT project and must be installed before B2BDT services will run on the computer. Contact Informatica support to obtain a B2BDT license file (B2BDTLicense.cfg). B2BDT Studio can be used without installing a license file.

B2BDT Installation and Configuration


The B2BDT installation consists of two main components - the B2BDT development workbench (Studio) and the B2BDT Server (which is an application deployed on a server). The installation tips apply to Windows environments. This section should be used as a supplement to the B2B Data Transformation Installation Guide. Before installing B2BDT complete the following steps:
q q q q q q

Verify that the hardware meets the minimum system requirements for B2BDT. Ensure that the combination of hardware and operating system are supported by B2BDT. Ensure that sufficient space has been allocated for the B2BDT serviceDB. Ensure that all necessary patches have been applied to the operating system. Ensure that the B2BDT license file has been obtained from technical support. Be sure to have administrative privileges for the installation user id.

Adhere to the following sequence of steps to successfully install B2BDT. 1. Complete the B2BDT pre-install checklist and obtain valid license keys. 2. Install B2BDT development workbench (studio) on the windows platform. 3. Install the B2BDT server on a server machine. When used in conjunction with PowerCenter, the server component must be installed on the same physical machine where PowerCenter resides. 4. Install necessary client agents when used in conjunction with WebSphere, WebMethods and BizTalk. In addition to the standard B2BDT components that are installed by default additional libraries can be installed. Refer to the B2BDT documentation for detailed information on these library components.

B2BDT Install Components


The install package includes the following components.
q

B2B Data Transformation Studio


BEST PRACTICES 24 of 954

INFORMATICA CONFIDENTIAL

q q q q q

B2B Data Transformation Engine Document Processors Documentation Optional agents Optional libraries

The table below provides descriptions of each component: Component Engine Description The runtime module that executes B2BDT data transformations. This module is required in all B2BDT installations. The design and configuration environment for creating and deploying data transformations. B2BDT Studio is hosted within Eclipse on Windows platforms. The Eclipse setup is included in the B2BDT installation package. A set of components that perform global processing operations on documents, such as transforming their file formats. All the document processors run on Windows platforms, and most of them run on UNIX-type platforms. Libraries of predefined B2BDT data transformations, which can be used with industry messaging standards such as EDI, ACORD, HL7, HIPAA, and SWIFT. Each library contains parsers, serializers, and XSD schemas for the appropriate messaging standard. The libraries can be installed on Windows platforms. B2BDT Studio can be used to import the library components to projects in order to deploy the projects to Windows or UNIX-type platforms. An online help library, containing all the B2BDT documentation.

Studio

Document Processors

Optional Libraries

Documentation

Install the B2BDT Studio and Engine


Step 1:
Run the Windows installation file from the software folder on the installation CD and follow the prompts. Follow the wizard to complete the install.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

25 of 954

TIP During the installation a language must be selected. If there are plans to change the language at a later point in time in the Configuration Editor, Informatica recommends that a non-English language is chosen for the initial setup. If English is selected and then later changed to another language some of the services that are required for other languages might not be installed.

q q

The default installation path is C:\Informatica\ComplexDataExchange. The default Service Repository Path is <INSTALL_DIR>/ServiceDB. This is the storage location for data transformations that are deployed as B2BDT services. The default Log path is <INSTALL_DIR>/CMReports. The Log Path is the location where the B2BDT Engine stores its log files. The log path is also known as the reports path.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

26 of 954

The repository location, JRE path and Log path can be changed subsequent to the installation using environment variables.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

27 of 954

Step 2:
Install the license file. Verify the validity of the license file with the following command: CM_console v The system displays information such as the location and validity of the license file.

Step 3:
Configure the Environment Variables. The B2BDT setup assigns several environment variables which point to the installation directory and to other locations that the system needs. On Windows, the B2BDT setup creates or modifies the following environment variables: Environment Variable PATH Required/Optional Purpose of the Variable Required The environment variables file adds <INSTALL_DIR>/bin to the paths. Note: In rare instances, the B2BDT Java document processors require that the JRE be added to the path.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

28 of 954

CLASSPATH

Required

The setup adds <INSTALL_DIR>\api\lib\CM_JavaAPI.jar to the path.

Required CLASSPATH Required IFCONTENTMASTER_HOME Optional IFConfigLocation4

The environment variables file adds <INSTALL_DIR>/api/lib/ CM_JavaAPI.jar to the Java class path.

The setup creates this environment variable, which points to the B2BDT installation directory (<INSTALL_DIR>).

The path of the B2BDT configuration file.

Step 4:
Configuration settings. The configuration application allows for the setting of properties such as JVM parameters, thread pool settings, memory available to the studio environment and many others. Consult the administrators guide for a full list of settings and their effects. Properties set using the B2BDT configuration application affect both the operation of the standalone B2BDT runtime environment and the behavior of the B2BDT studio environment. To open the Configuration Editor in Windows, from the Start menu choose Informatica > B2BDT > Configuration

INFORMATICA CONFIDENTIAL

BEST PRACTICES

29 of 954

Some of the Configuration Editor settings are available for all B2BDT installations. Additional settings vary depending on the B2BDT version and the optional components installed. The Configuration Editor saves the configuration in an XML file. By default, the file is <INSTALL_DIR/CMConfig.xml>. The B2BDT studio environment should be installed on each developers machine or environment. While advances in virtualization technologies and technologies such as Windows remote desktop connections theoretically allow for multiple users to share the same B2BDT installation, the B2BDT studio environment does not implement mechanisms such as file locking during the authoring of transformations that are needed to secure multiple users from overwriting each others work. An environment variable can be defined called IFConfigLocation4. The value of the variable must be the path for a valid configuration file (i.e., c:\MyIFConfigLocation4\CMConfig1.xml). For example, if two users want to run B2BDT Engine with different configurations on the same platform, store their respective configuration files in their home directories. Both files must have the name CMConfig.xml. Alternately store a CMConfig.xml file in the home directory for one of the users. The other user will use the default configuration file (e.g., <INSTALL_DIR>/CMConfig.xml). TIP Always save a backup copy of CMConfig.xml prior to editing. In the event of a problem the last known backup can be restored. The file <INSTALL_DIR>/CMConfig.bak is a backup of the original <INSTALL_DIR/CMConfig.xml> which the setup program created when B2BDT was installed. Restoring CMConfig.bak reverts B2BDT to its original configuration.

OS environment variables are used to set aspects of the system such as the Java classpath to be used, location of the configuration file for a specific user, home location for the installed B2BDT instance to be used, library paths, etc.
INFORMATICA CONFIDENTIAL BEST PRACTICES 30 of 954

The following table lists some typical configuration items and where they are set: Type of configuration item Memory for Studio JVM / JRE usage Tuning parameters threads, timeouts etc User specific settings Where configured B2BDT Configuration application B2BDT Configuration application B2BDT Configuration application Use environment variable to point to different configuration file B2BDT Configuration application B2BDT Configuration application (B2BDT 4.3), B2BDT Studio (B2BDT 4.4) Set in project properties B2BDT Configuration application

Memory for runtime Workspace location

Event generation Repository location

In-Process or Out-of-Process Invocation Out-of-process invocation requires the use of the B2BDT Server application (which is already installed by the install process). The distinction is that running under server mode causes transformations to potentially run slower, but errors will be isolated from the calling process. For Web Services, sometime the use of Server mode is recommended as the lifetime of the host process then becomes independent of the life time of the process space allocated to run the web service. For example IIS can run web services in a mode where a process dies or is recycled after a call to web services. For B2BDT the first call after a process startup can take up to 3 seconds (subsequent calls are usually milliseconds) hence it is not optimal to start host process on each invocation. Running in server mode keeps process lifetimes independent. TIP B2BDT Studio or the CM_console command, always runs data transformations in-process.

Running out-of-process has the following advantages:


q q q

Allows 64-bit processes to activate 32-bit versions of B2BDT Engine An Engine failure is less likely to disrupt the calling application Help prevent binary collisions with other modules that run in the process of the calling application.

In-process invocation has the following advantage:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

31 of 954

Faster performance than out-of-process.

Thread pool settings The thread pool controls the maximum number of Engine threads that can run client requests concurrently per process. If the number of client requests exceeds the number of available threads, the Server queues the requests until a thread is available. The default setting is 4. Some recommendations are summarized in the table below. Actual needs vary depending upon requirements. Best practices and additional recommendations are part of Jumpstart and Base Line Architecture engagements. Contact an Informatica representative for additional information. Key Settings Eclipse settings Parameters Memory available to studio Suggestions By default Eclipse allocates up to 256MB to Java VM Set to vmargs Xmx512M to allocate 512mb Log file locations Location security needs to match identity of B2BDT engine Need to have read permissions for service db locations

ServiceDB

Preprocessor buffer sizes

Change if running out of memory during source file processing

Service Refresh Interval

INFORMATICA CONFIDENTIAL

BEST PRACTICES

32 of 954

Step 5:
Configure ODBC connectivity. Note: this step is only needed if the ODBC database support features of B2BDT will be used. In such case, an ODBC driver may need to be configured.

Step 6:
Test the installation to confirm that B2BDT operates properly Note: Tests are available to test the engine and document processor installation. Refer the directory <INSTALL_DIR>\setupTests for B2BDT test projects testCME and testCMDP.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

33 of 954

B2BDT Integration With PowerCenter


B2BDT does support using the runtime as a server process to be invoked from PowerCenter on the same physical machine (in addition to offering the ability to invoke the B2BDT runtime engine in-process with the calling environment). While this does constitute a server process in developer terminology, it does not provide full server administration or monitoring capabilities that are typical of enterprise server products. It is the responsibility of the calling environment to provide these features. Part of the overall solution architecture is to define how these facilities are mapped to a specific B2BDT implementation.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

34 of 954

Installation of B2BDT for PowerCenter is a straightforward process. All required plugins needed to develop and run B2BDT transformation are installed as part of a PowerCenter installation. However, certain B2BDT versions and B2BDT plugins need to be registered after the install process. Refer to the PowerCenter Installation Guide for details. The repository option copies the B2BDT plug-ins to the Plugin directory. Register the B2BDT plug-ins in the PowerCenter repository.

PowerCenter 7.1.x
Register the UDT.xml plug-in in the PowerCenter Repository Server installation Plugin directory. The B2BDT plug-in will appear under the repository in the Repository Server Administration Console.

PowerCenter 8.1.x
Register the pmudt.xml plug-in in the Plugin directory of the PowerCenter Services installation. When the B2BDT plug-in is successfully registered in PowerCenter 8.1 it will appear in the Administration Console as follows:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

35 of 954

Note:A PowerCenter UDO Transformation can only be created if the UDO plug-in is successfully registered in the repository. If the UDO Option is installed correctly then UDO Transformations can be created in the PowerCenter Designer. Note: ODBC drivers provided for PowerCenter are not automatically usable from B2BDT as licensing terms prohibit this in some cases. Contact an Informatica support representative for further details.

Additional Details for PowerCenter Integration

B2BDT is a Custom Transformation object within PowerCenter


r r r

INFA passes data via memory buffers to the B2BDT engine and retrieves that output via buffers. The B2BDT engine runs IN-PROCESS with the PowerCenter engine. The Custom Transformation object for Informatica can be dragged and dropped inside a PowerCenter mapping

INFORMATICA CONFIDENTIAL

BEST PRACTICES

36 of 954

When using B2BDT transformation, PowerCenter does NOT process the input files directly, but instead takes a path and filename (from a text file). Then the engine processes the data through the B2BDT parser defined within the mapping. After this the data is returned to the PowerCenter B2BDT transformation for processing by other Informatica transformation objects. TIP Verify that the Source filename is the name of the text file where both the file path and the file name are present. It can not be the actual file being parsed by PowerCenter and B2BDT. This is the direct versus indirect sourcing of the file.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

37 of 954

Useful Tips and Tricks


Can I use an existing Eclipse install with B2BDT?
q

Yes. But make sure it is compatible with the version of B2BDT installation. Check with product compatibility matrix for additional information. B2BDT can be made to work with a different version of Eclipse however it is not guaranteed.

Is there a silent install available for B2BDT on Windows?


q

As of B2BDT 4.4 there is no silent install mode. But there is likely be a future release.

Version Compatibility?
q

Ensure that version of B2BDT is compatible with PowerCenter. Otherwise many issues can manifest in different forms. In general B2BDT 4.4 is compatible with 8.5 and with 8.1.1 SP4 (and SP4 only), B2BDT 4.0.6 is compatible with 8.1.1, and B2BDT 3.2 is compatible with PC 7.x. For more information refer to the Product Availability Matrix.

Service Deployment?
q q

Ensure that services are deployed on a remote machine where PowerCenter installed. Services deployed from studio show up in PowerCenter designer B2BDT transformation as a dropdown list (see screenshot below).

Note: These services are only ones that are deployed on a local machine. If any services are deployed on remote machines the designer will not display them. As it is easy to mistake them for remote services manually ensure that the services for local and remote machines are in sync.
q

After making certain that the services are deployed in the remote server that also has PowerCenter installed, the B2BDT transformation can be specified from the UDO (8.1.1) or B2BDT (8.5) transformation on the Metadata Extensions tab.
BEST PRACTICES 38 of 954

INFORMATICA CONFIDENTIAL

Common Installation Troubleshooting Tips Problem


Problem Description The following error occurs when opening B2BDT studio: There was a problem running ContentMaster studio, Please make sure /CMConfig.XML is a valid configuration file (Error code=2) Solution To resolve this issue, do the following: Edit the CMConfig.xml file Add the below section of code after </CMAgents> and before <CMDocumentProcessors version="4.0.6.61"/> in the file: <CMStudio version="4.0.6.61"> <Eclipse> <Path>C:/Program Files/Itemfield/ContentMaster4/eclipse</Path> <Workspace>C:\Documents and Settings\kjatin.INFORMATICA\My Documents\Itemfield\ContentMaster\4.0 \workspace</Workspace> </Eclipse> </CMStudio> Note: Modify the path names as necessary to match the installation settings.

Problem
Problem Description The Content Master studio fails to open with the following error: Failed to Initialize CM engine! CM license is limited to 1 CPU, and is not compatible with this machine's hardware. Please contact support. Cause The Content Master license is licensed for a fewer number of CPUs then what is on the machine. While registering incorrect information was entered for number of CPUs and so the license provided is for machine with lesser number of CPUs. Solution To resolve the issue do the registration again, enter the right number of CPU and send the new registration.txt to Informatica Support to get the new license. When the new license is received replace it over the existing one in the Content Master Installation directory.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

39 of 954

Problem
Problem Description When launching the Designer after installing the Unstructured Data Option (UDO) option, the following error is displayed: Failed to load DLL: pmudtclient.dll for Plug-in: PC_UDT Cause This error occurs when Content Master has not been installed along with PowerCenter UDO. Solution To resolve this issue, install Content Master

Last updated: 31-May-08 19:00

INFORMATICA CONFIDENTIAL

BEST PRACTICES

40 of 954

Deployment of B2B Data Transformation Services Challenge


Outline the steps and strategies for deploying B2B Data Transformation services.

Description
Deployment is a process wherein a data transformation is made available as a service that is accessible to the B2B Data Transformation runtime engine. When a project is published to a specific transformation service, a directory name is created that corresponds to the published service name in the B2B Data Transformation Service DB which forms a runtime repository of services. A CMW file corresponding to the service name will be created in the same directory. The deployed service is stored in the Data Transformation service repository. On Windows platforms, the default repository location is: c:\Program Files\Informatica\ComplexDataExchange\ServiceDB On UNIX platforms, the default location is: /opt/Informatica/ComplexDataExchange/ServiceDB

Basics of B2B Data Transformation Service Deployment


When running in the B2B Data Transformation studio environment, developers can test the service directly without deployment. However, in order to test integration with the host environment, platform agents or external code, it is necessary to deploy the service. Deploying the transformation service copies the service with its current settings to the B2B data transformation service repository (also known as the Service DB folder). Deploying a service also sets the entry point for the transformation service. Note: The location of the service repository is set using the B2B Transformation configuration utility If changes are made to the project options or to the starting point of the service, it is necessary to redeploy the service in order for the changes to take effect. When the service is deployed, all service script files, schemas, sample data and other project artifacts will be deployed to the service repository as specified by the B2B Data Transformation configuration options in effect in the studio environment from which the service is being deployed. A transformation service can be deployed multiple times under different service names with the same or different options for each deployed service. While Informatica recommends only deploying one service from each B2B data transformation project for production, it is useful to deploy a transformation service under different names when testing different option combinations.

Deployment for Test Purposes


It is important to finish configuration and testing of data transformations before deploying it as a B2B Data Transformation service. Deploying the service allows the B2B Data Transformation runtime engine to access and run the project. When running in the B2B Data Transformation studio environment, developers can test the service
INFORMATICA CONFIDENTIAL BEST PRACTICES 41 of 954

directly without deployment. However, in order to test integration with the host environment, platform agents or external code, it is necessary to deploy the service.

Initial Production Deployment of B2B Data Transformation Services


Deploying services in the production environment allows applications to run the transformation services on live data. B2B Data Transformation services can be deployed from the B2B Data Transformation Studio environment computer to a remote computer such as a production server. The remote computer can be a Windows or UNIX-type platform, where B2B Data Transformation Engine is installed. A service can be deployed to a remote computer by either a) directly deploying it to the remote computer or b) deploying the service locally and then copying the service to a remote computer. To deploy a service to a remote computer: 1. Deploy the service on the development computer. 2. Copy the deployed project directory from the B2B Data Transformation repository on the development computer to the repository on the remote computer 3. If you have added any custom components or files to the B2B Data Transformation autoInclude\user directory, you must copy them to the autoInclude\user directory on the remote computer. Alternatively, if the development computer can access the remote file system, you can change the B2B Data Transformation repository to the remote location and deploy directly to the remote computer.

Deployment of Production Updates to B2B Data Transformation Services


B2B Data Transformation Studio cannot open a deployed project that is located in the repository. If you need to edit the data transformation, modify the original project and redeploy it. To edit and redeploy a project: 1. Open the development copy of the project in B2B Data Transformation Studio. Edit and test it as required. 2. Redeploy the service to the same location, under the same service name. You are prompted to overwrite the previously deployed version. Redeploying overwrites the complete service folder, including any output files or other files that you have stored in it. There is no versioning available in B2B Data Transformation. If previous versions of the deployed services are required, make a copy of the current service in a separate location, if desired (not in the service DB directory) or utilize a commercial or open source backup solution. Renaming the service folder is also possible. The project name has to be renamed as well. This is not a recommended practice for backing up of services or deploying a service multiple times. It is preferred to use the Studio environment to deploy a service multiple times as behaviors may change in future versions. For backup, there are many commercial and open source back up solutions available, and in order to quickly retain a copy of the service, the service should be copied to a directory outside of the Service DB folder. Important: There can be no more than one deployed service with the same service and project name. Project files contain configuration properties and indicate the transformation startup component. Having multiple services with identical project file names, even if the service names are different, will cause the service execution to fail.

Simple Service Deployment


There are two ways to deploy a service. One way is to deploy it directly as service within Data Transformation Studio while the other is to deploy the service locally and copy the service folder to the appropriate ServiceDB.
INFORMATICA CONFIDENTIAL BEST PRACTICES 42 of 954

Single Service Deployment from Within B2B Data Transformation Studio Environment 1. In the B2B Data Transformation Explorer, select the project to be deployed.

2. On the B2B Data Transformation menu, click Project > Deploy.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

43 of 954

3. The Deploy Service window displays the service details. Edit the information as required. Click the Deploy button.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

44 of 954

4. Click OK.

5. At the lower right of the B2B Data Transformation Studio window, display the Repository view. The view lists the service that you have deployed, along with any other B2B Data Transformation services that have been deployed on the computer.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

45 of 954

Single Service Deployment Via File Movement Alternatively, the service folder can be copied directly into the ServiceDB folder. On Windows

To check if the service deployed is valid, run CM_Console in the command line.
INFORMATICA CONFIDENTIAL BEST PRACTICES 46 of 954

Alternatively, the cAPITest.exe can be used test the deployed service.

The B2B Data Transformation Engine determines whether any services have been revised by periodically examining the timestamp of a file called update.txt. By default, the timestamp is examined every thirty seconds. The update.txt file exists in the repository root directory which is, by default, the ServiceDB directory. The content of the file can be empty. If this is the first time a service is deployed to the remote repository, update.txt might not exist. If the file is missing, copy it from the local repository. If update.txt exists, update its timestamp as follows.
q q

On Windows: Open update.txt in Notepad and save it On UNIX: Open a command prompt, change to the repository directory, and enter the following command. touch update.txt

You can change the interval used to check for service updates by modifying the Service refresh interval in the B2B Data Transformation configuration editor.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

47 of 954

Multi-Service Deployment
When a solution involves the use of multiple services, these may be authored as multiple independent B2B Data Transformation projects or as a single B2B Data Transformation project with multiple entry points to be deployed as multiple services under different names. For complex solutions, we recommend the use of multiple separate projects for independent services, reserving the use of multiple runnable components within the same project for test utilities, and trouble shooting items. While it is possible to deploy the set of services that make up a multi-service solution into production from the Studio environment, we recommend deploying these services to a test environment where the solution can be verified before deploying into production. In this way, mismatch between different versions of solution transformation services can be avoided. In particular when dependencies occur between services due to the use of B2B Data Transformation features such as TransformByService, or due to interdependencies in the calling system, it is necessary to avoid deploying mismatching versions of transformation services and to deploy services into production as a group. Simple batch files or shell scripts can be created to deploy the services as a group from the test environment to the production environment, and commercial enterprise system administration and deployment software will usually allow creation of a deployment package to facilitate scheduled unattended deployment and monitoring of deployment operations. As a best practice, creating a dependency matrix for each project to be deployed allows developers to identify the required services by each project to be deployed and which are commonly accessed by majority of the projects. This allows for better deployment strategies and helps to keep track of impacted services should there be changes made to them.

Deploying for Full Uptime Systems


B2B Data Transformation has the ability to integrate into various applications allowing it to become a full uptime system. An integration component, called the B2B Data Transformation Agent, runs a B2B Data Transformation service that performs the data transformation. Integration systems capabilities are enhanced by supporting the
INFORMATICA CONFIDENTIAL BEST PRACTICES 48 of 954

conversion of many document formats it do not natively support. Deploying services for full uptime systems follows the same process as that of standalone B2B Data Transformation services. However, it is important to make sure that the user accounts used for the calling application have the necessary permissions to execute the B2B Data Transformation service and write to configuration to store error logs. After deploying the service, it may be necessary to stop and restart the work flow invoking the service. Make sure that the update.txt timestamp is updated. B2B Data Transformation Engine determines whether any services have been revised by periodically examining the timestamp update.txt. By default, the timestamp is examined every thirty seconds.

Multiple Server Deployment


For enhanced performance, you can install B2B Data Transformation on multiple Windows or UNIX servers. The following discussion assumes that you use a load balancing module to connect to multiple, identically configured servers. The servers should share the same B2B Data Transformation services. There are two ways to implement a multiple server deployment.
q

Shared file system Store a single copy of the B2B Data Transformation repository on a shared disk. Configure all the servers to access the shared repository.

Replicated file system Configure each server with its own B2B Data Transformation repository. Use an automatic file deployment tool to mirror the B2B Data Transformation repository from a source location to the individual servers.

If the second approach is adopted, it is a must to replicate or touch the file update.txt, which exists in the repository directory. The timestamp of this file notifies B2B Data Transformation Engine when the last service update was performed.

Designing B2B Data Transformation Services for Deployment Identifying Versions Currently Deployed
Whenever a service is deployed through B2B Data Transformation Studio, the user is prompted to set the options shown in the table below. Option Description

INFORMATICA CONFIDENTIAL

BEST PRACTICES

49 of 954

Service Name

The name of the service. By default, this is the project name. To ensure cross-platform compatibility, the name must contain only English letters (A-Z, a-z), numerals (0-9), spaces, and the following symbols: %&+-=@_{} B2B Data Transformation creates a folder having the service name, in the repository location.

Label

A version identifier. The default value is a time stamp indicating when the service was deployed. The runnable component that the service should start. The person who developed the project. A description of the service.

Startup Component Author Description

Although version tracking is not available in the current version of B2B Data Transformation, deployment does take into account the service deployment timestamps. The deployment options are stored in a log file called deploy.log. It keeps a history of all deployments options made through the B2B Data Transformation Studio. The option settings entered in the Deploy Service window are appended to the log file.

Deploying services to different servers through file copying or FTP will not update the deployment log file. It has to be manually updated if added information is required.

Security and User Permissions


User permissions are required by users who install and use B2B Data Transformation Studio and Engine. Depending on the B2B Data Transformation application the organization runs, and the host environment used to invoke the services, additional permissions might be required. To configure data transformations in B2B Data Transformation Studio, users must have the following permissions:
q

Read and write permission for the Eclipse workspace location


BEST PRACTICES 50 of 954

INFORMATICA CONFIDENTIAL

Read and execute permission for the B2B Data Transformation installation directory and for all its subdirectories Read and write permission for the B2B Data Transformation repository, where the services are deployed Read and write permissions for the log application

q q

For applications running B2B Data Transformation Engine, a user account with the following permissions is required.
q q q

Read and execute permission for the B2B Data Transformation installation directory and for its subdirectories Read for the B2B Data Transformation repository Read and write permission for the B2B Data Transformation log path, or for any other location where B2B Data Transformation applications are configured to store error logs

Aside from user permissions, it is important to identify the user types that would be assigned work with B2B Data Transformation. In Windows setup, an administrator or limited user can be registered in the Windows Control Panel. Windows users who have administrative privileges can perform all B2B Data Transformation operations. However, limited users have the following restrictions do not have write permissions for the B2B Data Transformation program directory and are NOT allowed to perform the following:
q q q q q

Install or uninstall the B2B Data Transformation software Install a B2B Data Transformation license file Deploy services to the default B2B Data Transformation repository Add custom components such as document processors or transformers Change the setting values in the Configuration Editor

Backup Requirements
It is necessary to make regular backups of several B2B Data Transformation directories and files. In production environment where B2B Data Transformation runs, it is important to backup three locations the Configuration File, Service Repository, and AutoInclude\User directory. For development environment, we recommend using a commercial or open source-source control system such as Subversion to manage backup and versioning of the B2B Data Transformation Studio workspaces of the developers in the organization. In addition, backup the same locations listed above for production environment. If you use identical configurations on multiple servers, back up only a single copy of these items. In the event of a server failure, B2B Data Transformation can be re-installed in the same location as on the failed server and restore the backup.

Failure Handling
If a B2B Data Transformation service fails to execute successfully, it returns a failure status to the calling application. It is the responsibility of the calling application to handle the error. For example, the application can transmit failed input data to a failure queue. The application can package related inputs in a transaction to ensure that important data is not lost.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

51 of 954

In the event of a failure, the B2B Data Transformation Engine will generate an event log if event logging has been enabled for the project. To view the contents of the event file, drag the *.cme file into the events pane in the B2B Data Transformation Studio. The method used to invoke a B2B Transformation service will affect how and if events are generated. The follow table compares the effect of each invocation method on the generation of events: API / invocation method CM_Console Event generation Service deployed with events will produce events. Service deployed without events will not produce events Service runs without events. In case of error, service is rerun with events Same as Java No events unless error irrespective of how service was deployed. Same behavior is used for PowerCenter

Java API

C# / .Net Agents

While the events log provides a simple mechanism for error handling, it also has a high cost in resources such as memory and disk space for storing the event logs. For anything other than the simplest of projects, it is recommended to design an error handling mechanism into your transformations and calling logic to handle errors and the appropriate alerting needed when errors occur. In many production scenarios, the event log will need to be switched off for optimal performance and resource usage.

Updating Deployed Services


B2B Data Transformation Studio cannot directly update a deployed project in the transformation service repository. To perform updates on the data transformation, the modifications must be made to the original transformation project and the project then needs to be redeployed. Note: A different project can be used which may be deployed under the existing service name, so technically it does not have to be exactly the original project. If it is required to track all deployed versions of the data transformation, make a copy of the current service in a separate location, or alternatively, consider the use of a source control system such as Subversion. Redeploying overwrites the complete service folder, including any output files or other files that you have stored in it. It is important to test the deployed service following any modifications. While the Studio environment will catch some errors and block deployment if the transformation is invalid, some types of runtime errors cannot be caught by the studio environment prior to deployment.

Upgrading B2B Data Transformation Software (Studio and Runtime Environment)


When upgrading from a previous B2B Data Transformation release, existing projects and deployed services can also be upgraded to the current release. The upgrade of projects from B2B Data Transformation version 3.1 or higher is
INFORMATICA CONFIDENTIAL BEST PRACTICES 52 of 954

automatic. Individual projects can be opened or imported in the B2B Data Transformation Studio with the developer prompted to upgrade the project, if necessary. Test the project and confirm that it runs correctly once upgrade is completed. Deploy the service to production environment. Another way to upgrade the services is through the syntax conversion tool that comes with B2B Data Transformation. It allows upgrade of multiple projects and services quickly, in an automated operation. It is also used to upgrade global TGP script files, which are stored in the B2B Data Transformation autoInclude\user directory. Syntax conversion tool supports upgrade of project or service from 3.1 and higher on Windows while release 4 on UNIX-type platforms. Before the upgrade, the tool creates an automatic backup of your existing projects and files. It creates a log file and reports any upgrade errors that it detects. In case of an error, restore the backup, correct the problem, and run the tool again. It is necessary to organize the projects before running the tool. The tool operates on projects or services that are stored in a single parent directory. It can operate on:
q q q

A B2B Data Transformation Studio version 4 workspace A B2B Data Transformation repository Any other directory that contains B2B Data Transformation Studio projects or services

Within the parent directory, the projects must be at the top level of nesting, for example:

If the projects are not currently stored in a single parent directory, re-organize them before running the tool. Alternatively, the tool can be run separately on the individual parent directories. To run the syntax conversion tool in Windows, go the B2B Data Transformation folder from the Start menu then click Syntax Conversion Tool. The tool is a window with several tabs, where the upgrade settings can be configured.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

53 of 954

After the service upgrade is complete, change the repository location to the new location using the Configuration Editor. Test the projects and services to confirm that they work correctly and that their behavior has not changed. On UNIX platforms, run the command <INSTALL_DIR>/bin/CM_DBConverter.sh. Only 4.x is supported. Optionally, you can run the syntax conversion tool from the command line, without displaying the graphical user interface. In an open console, change to the B2B Data Transformation bin directory and run the following command:
q q

On Windows: CM_DBConverter.bat <switches> On UNIX: CM_DBConverter.sh console <switches>

Following each switch, leave a space and type the value. If a path contains spaces, you must enclose it in quotation marks. The <switches> are listed in the following table.

Switch -v

Required/Optional Required

Description Version from which you are upgrading (3 or 4). On UNIX, only 4 is supported. Path of the source directory, containing projects or services.

-s

Required

INFORMATICA CONFIDENTIAL

BEST PRACTICES

54 of 954

-d

Optional

Path of the target directory. If you omit this switch, the tool overwrites the existing directory. Path of the source autoInclude\user directory. If you omit this switch, the tool does not upgrade global TGP files. Path of the target autoInclude\user directory. If you omit this switch, the tool overwrites the existing directory. Path of the upgrade log file. The default is <INSTALL_DIR> \SyntaxConversionLog.txt. Path of the backup directory, where the tool backs up the original projects or services prior to the upgrade. The default is the value of the -s switch concatenated with the suffix _OLD_Backup. Path of the error directory, where the tool stores any projects or services that it cannot upgrade due to an error. The default is the value of the -s switch concatenated with the suffix _OLD_Failure.

-si

Optional

-di

Optional

-l

Optional

-b

Optional

-e

Optional

Last updated: 29-May-08 16:47

INFORMATICA CONFIDENTIAL

BEST PRACTICES

55 of 954

Establishing a B2B Data Transformation Development Architecture Challenge


Establish a development architecture that ensures support for team development of B2B Data Transformation solutions; establishes strategies for common development tasks such as error handling and the styles of B2B Data Transformation service authoring; and plans for the subsequent clean migration of solutions between development, test, quality assurance (QA) and production environments that can scale to handle additional users and applications as the business and development needs evolve.

Description
In this Best Practice the term development architecture means establishing a development environment and establishing strategies for error handling, version control, naming conventions, mechanisms for integration with the host environment and other aspects of developing B2B Data Transformation services not specific to a particular solution. Planning for the migration of the completed solution is closely related to the development architecture. This can include transfer of finished and work in progress solutions between different members of the same team, between different teams such as development, QA, and production teams and between development, test and production environments. Deciding how to structure the development environment for one or more projects depends upon several factors. These include technical factors such as choices for hosting software and host environments and organizational factors regarding the project team makeup and interaction with operations, support and external test organizations. Technical factors:
q q q q q q

What host environment is used to invoke the B2B Data Transformation services? What are the OS platform(s) for development, test and production? What software versions are being used for both B2B Data Transformation and for host environment software? How much memory is available on development, test and production platforms? Are there previous versions of the B2B Data Transformation software in use? The use of shared technical artifacts such as XML schemas shared between projects, services, applications and developers. What environments are expected to be used for development, test and production environments? (Typically development is performed on windows, test and production may be AIX, Solaris, Linux etc).

Organizational Factors:
q q q q q q

How do development, test, production and operations teams interact? Do individual developers work on more than one application at a time? Are the developers focused on a single project, application or project component? How are transformations in progress shared between developers? What source code control system, if any, is used by the developers? Are development machines shared between developers either through sequential use of a physical machine, through the use of virtual machines or through use of technologies such as Remote Desktop Access? How are different versions of a solution, application or project managed? What is the current stage of the project life cycle? For example has the service being modified already been deployed to production? Do developers maintain or create B2B Data Transformation services for multiple versions of B2B Data
BEST PRACTICES 56 of 954

q q

INFORMATICA CONFIDENTIAL

Transformation on products? Each of these factors plays a role in determining the most appropriate development environment for a B2B Data Transformation project. In some cases, it may be necessary to create different approaches for different development groups according to their needs B2B Data Transformation, together with the B2BDT Studio environment, offers flexible development configuration options that can be adapted to fit the need of each project or application development team. This Best Practice is intended to help the development team decide what techniques are most appropriate for the project. The following sections discuss various options that are available, based on the environment and architecture selected.

Terminology
B2B Data Transformation (abbreviated as B2BDT) is used as a generic term for the parsing, transformation and serialization technologies provided in Informaticas B2B Data Exchange products. These technologies have been made available through the Unstructured Data Option for PowerCenter, and as standalone products such as B2B Data Transformation and its earlier versions known respectively as B2B Data Transformation, PowerExchange for Complex Data (and formerly known as ItemField ContentMaster). The B2B Data Transformation development environment uses the concepts of workspaces, projects and services to organize its transformation services. The overall business solution or solutions may impose additional structure requirements such as organizing B2BDT services into logical divisions such as solutions, applications, projects and business services corresponding to the needs of the business. There may be multiple B2BDT services corresponding to these logical solution elements. We will use the terms B2BDT service to refer to a single Complex Exchange transformation service, and B2BDT project to refer to the B2B Data Transformation project construct as exposed within the B2BDT Studio environment. Through out this document we use the term developers to refer to team members who create B2BDT services, irrespective of their actual roles in the organization. Actual roles may include business analysts, technical staff in a project or application development teams, members of test and QA organizations, or members of IT support and helpdesk operations who create new B2BDT transformations or maintain existing B2BDT transformations.

Fundamental Aspects of B2BDT Transformation Development


There are a number of fundamental concepts and aspects to development of B2BDT transformations that affect design of the development architecture and distinguish B2BDT development architecture from other development architectures.

B2BDT is an Embedded Platform


When B2BDT transformations are placed into production, the runtime is typically used in conjunction with other enterprise application or middleware platforms. The B2BDT runtime is typically invoked from other platform software (such as PowerCenter, BizTalk, WebMethods or other EAI or application server software) through the use of integration platform adapters, custom code or some other means. While it is also possible to invoke B2BDT services from a command line utility (CM_Console) without requiring the use of additional platform software, this is mainly provided for quick testing and troubleshooting purposes. CM_Console does not provide access to all available system memory or scale across multiple CPUs. Specifically, restrictions on the CM_Console application include always running the B2BDT transformation engine in-process and use of the local directory for event output. B2BDT does support using the runtime as a server process to be invoked from other software on the same physical machine (in addition to offering the ability to invoke the B2BDT runtime engine in-process with the calling environment). While this does constitute a server process in developer terminology, it does not provide full server administration or
INFORMATICA CONFIDENTIAL BEST PRACTICES 57 of 954

monitoring capabilities that are typical of enterprise server products. It is the responsibility of the calling environment to provide these features and part of the overall solution architecture is to define how these facilities are mapped to a specific B2BDT implementation. B2BDT Deployment with PowerCenter

While the B2BDT runtime is usually deployed on the same machine as the host EAI environment, it is possible to locate the B2BDT services (stored in a file based repository) on the same machine or a remote machine. It is also possible to deploy B2BDT services to be exposed as a set of web services, and in this case the hosting web/ application server forms the server platform that provides these server software services. The web service platform in turn will invoke the B2BDT runtime either in-process with the web service stack or as a separate server process on the same machine. Note: Modern application servers often support mechanisms for process, application and thread pooling which blurs the distinctions between the effects of in-process vs. server invocation modes. In process invocation can be thought of as running the B2BDT transformation engine as a shared library within the calling process. B2BDT Deployed as Web Services Used With Web Application

INFORMATICA CONFIDENTIAL

BEST PRACTICES

58 of 954

Sample Data for Parse by Example and Visual Feedback


During the process of authoring a B2BDT transformation, sample data may be used to perform actual authoring through drag-and-drop and other UI metaphors. Sample data is also used to provide visual confirmation at transformation design time of what elements in the data are being recognized, mapped and omitted. Without sample data, there is no way to verify correctness of a transformation or get feedback on the progress of a transformation during authoring. For these reasons, establishing a set of sample data for use during authoring is an important part of planning for the development of B2BDT transformations. Sample data to be used for authoring purposes should be representative of actual data used during production transformations but sized to avoid excessive memory requirements on the studio environment. While the studio environment does not impose specific limits on data to be processed, the cumulative effects of using document preprocessors within the studio environment in conjunction with use of the B2BDT event reporting can impose excessive memory requirements.

Eclipse-Based Service Authoring Environment


The B2BDT authoring environment, B2BDT Studio is based on the widely supported Eclipse platform. This has two implications: 1. Many switches, configuration options, techniques and methods of operation that affect the Eclipse environment are also available in B2BDT studio. These include settings for memory usage, version of JVM used by the studio environment etc. 2. Eclipse plug-ins that support additional behaviors and / or integration of other applications such as source code control software can be used with the B2BDT Studio environment. While the additional features offered by these plug-ins may not be available in the B2BDT authoring perspective, by switching perspectives, B2BDT developers can often take advantages of the features and extensions provided by these plug-ins. Note: An Eclipse perspective is a task oriented arrangement of views, menu options commands etc. For example while using the B2BDT authoring perspective, features for creation of Java programs or source control will not be visible but they may be accessed by changing perspectives. Some features of other perspectives may be incompatible with use of
INFORMATICA CONFIDENTIAL BEST PRACTICES 59 of 954

the B2BDT authoring perspective There are a number of features in B2BDT that may only run on Windows and some custom components such as custom COM based actions or transformations are Windows specific also. This means it is possible to create a transformation within the Studio environment that will only run on the development environment and may not be deployed into production on a non Windows platform.

Service Authoring Environment Only Supported on Windows OS Variants


While B2BDT services may be deployed and placed into production on many environments such a variety of Linux implementations, AIX, Solaris and Windows Server OS variations, the B2BDT Studio environment used to author B2BDT services only runs on Windows OS variants such as Windows 2000 and Windows XP. There are a number of features in B2BDT that may only run on Windows and some custom components such as custom COM based actions or transformations are Windows specific also. This means it is possible to create a transformation within the Studio environment that will only run on the development environment and may not be deployed into production on a non-Windows platform.

File System Based Repository for Authoring and Development


B2BDT uses a file system based repository for runtime deployment of B2BDT services and a similar file based workspace model for the physical layout of services. This means that mechanisms for sharing of source artifacts such as schemas and test data; projects and scripts and deployed solutions must be created using processes and tools external to B2BDT. These might include use of software such as source control systems for sharing of transformation sources, third party application deployment software and processes implemented either manually or through scripting environments for management of shared artifacts, deployment of solutions etc.

Support for Learn-by-Example Authoring Techniques


Authoring of a B2BDT solution may optionally use supplied sample data to determine how to extract or parse data from a representative source data construct. Under this mechanism, a transformation developer may elect to let the B2BDT runtime system decide how to extract or parse data from a sample data input. When this mechanism is used, the sample data itself becomes a source artifact for the transformation and changes to the sample data can affect how the system determines the extraction of appropriate data. Use of Learn by Example in B2BDT

INFORMATICA CONFIDENTIAL

BEST PRACTICES

60 of 954

When using learn-by-example transformations, the source data used as an example of the data must be deployed with the B2BDT project as part of the production B2BDT service. It is recommended in many cases to use the learn by example mechanism as a starting point only and to use specific transformation (non learn-by- example) mechanisms for data transformation with systems requiring a high degree of fine control over the transformation process. If learn by example mechanisms are employed, changes to the sample data should be treated as requiring the same degree of test verification as changes to the transformation scripts.

Support for Specification Driven Transformation Authoring Techniques


As B2BDT transformations are also represented as a series of text files, it is possible to parse a specification (in a Microsoft Word, Microsoft Excel , Adobe PDF or other format document) to determine how to generate a transformation. Under this style of development, the transformation developer would parse one or more specifications rather than the actual source data and generate one or more B2BDT transformations as output. This can be used instead of or in addition to, standard transformation authoring techniques. Many of the Informatica supplied B2BDT libraries are built in this fashion. Note: Typically at least one transformation will be created manually in order to get an approximation of the target transformation desired.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

61 of 954

In these cases, specifications should be treated as source artifacts and changes to specifications should be verified and tested (in conjunction with the spec driven transformation services) in the same manner as changes to the transformations.

B2B Data Transformation Project Structure


The B2B Data Transformation Studio environment provides the user interface for the development of B2B Data Transformation services. It is based on the open source Eclipse environment and inherits many of its characteristics regarding project organization. From a solution designers viewpoint, B2B Data Transformation solutions are organized as one or more B2B Data Transformation projects in the B2BDT Studio workspace. Studio Environment Indicating Main Views

INFORMATICA CONFIDENTIAL

BEST PRACTICES

62 of 954

The B2BDT workspace defines the overall set of transformation projects that may be available to a developer working in a single studio session. Developers may have multiple workspaces, but only one workspace is active within the studio environment at any one time. All artifacts such as scripts, project files and other project elements are stored in the file system as text files and can be versioned using traditional version control systems. Each B2BDT project can be used to publish one or more B2BDT services. Typically a single project is only used to publish a single primary service although it may be desirable to publish debug or troubleshooting variants of a project under different service names. Note: The same B2BDT project can be published multiple times specifying different entry point or configuration parameters. The syntax displayed in the studio environment differs from the text representation of the script files such as TGP files, which make up the B2B Data Transformation project. This will be discussed further when reviewing considerations for multi-person team development. From a physical disk storage viewpoint, the workspace is a designated file system location where B2BDT Studio stores a set of B2BDT projects. By default, there is a single B2B Data Transformation workspace, which is located in the directory My Documents\Informatica\ComplexDataExchange\4.0\workspace All projects in the current B2B Data Transformation Studio workspace are displayed in the Explorer view.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

63 of 954

Note: It is possible to have other workspaces for Java projects etc. These are not visible in the Complex Data Authoring perspective in B2B Data Transformation Studio. Optionally, it is possible to create more than one workspace. For example, a solution designer might have multiple workspaces for different sets of B2B Data Transformation projects. TIP For B2BDT Studio 4.3 and earlier releases, Use the B2BDT Studio\Eclipse\Workspace setting in the B2BDT configuration editor to change the workspace. In B2BDT 4.4, you may change the workspace by using the File | Switch Workspace menu option.

Each B2B Data Transformation project holds the business rules and operations for one or more transformation services. Once completed or while under development, the project may be published to the B2B Data Transformation repository to produce a deployable transformation service. During the publication of a transformation service, an entry point to the service is identified and a named transformation service is produced that specifies a particular transformation project along with a well known entry point where initial execution of the transformation service will take place. It is possible to publish the same project multiple times with different names, identifying a different entry point on each deployment or even to publish the same project multiple times with the same entry point under different names. Published B2B Data Transformation services are published to the runtime repository of services. In B2B Data Transformation, this takes the form of a file system directory (typically c:\program files\Informatica\ComplexDataExchange \ServiceDB) known as the service DB. This may be located on a local or network accessible file system Once the development version of a transformation service has been published, it may then be copied from the service database location by copying the corresponding named directory from the service DB location. This service directory can then be deployed by copying it to the service db directory on a production machine.

File System View of Workspace


The workspace is organized as a set of sub-directories, with one sub-directory representing each project. A specially designated directory named .metadata is used to hold metadata about the current workspace. Each subdirectory is named with the project name for that project. Workspace Layout

INFORMATICA CONFIDENTIAL

BEST PRACTICES

64 of 954

Behind the scenes (by default) B2B Data Transformation creates a new project in a directory corresponding to the project name rooted in the Eclipse workspace. (In B2BDT 4.3, this can be overridden at project creation time to create projects outside of the workspace; while in B2BDT 4.4, the studio environment will determine whether it needs to copy a project into the workspace. If the path specified for the imported project is already within a workspace, B2BDT will simply add the project to the list of available projects in the workspace). A .cmw file with the same primary project name will also be created within the project directory the cmw file defines what schemas, scripts and other artifacts make up the project. When a project is published to a specific transformation service, a directory name is created that corresponds to the published service name in the B2B Data Transformation Service DB which forms a runtime repository of services. A CMW file corresponding to the service name will be created in the same directory. Creating a new project while in the studio environment will cause changes to be made to the metadata directory in order for the project to be discoverable in the B2BDT Studio environment.

File System View of Service DB


The service database is organized as a set of sub-directories under the service database root project, with one subdirectory representing each deployed service. When a service is deployed, the service will be copied along with the settings in effect at the time of deployment. Subsequent changes to the source project will not affect deployed services, unless a project is redeployed under the same service name. It is possible to deploy the same B2BDT project under multiple different service names. TIP If a project contains a sample data file with the extension .cmw it can cause the B2BDT runtime to detect an error with that deployed service. This can prevent all services being detected by the runtime. If it is necessary to have a sample data file with the extension .cmw use a different extension for the sample data and adjust scripts accordingly. This scenario can commonly occur with specification driven transformations.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

65 of 954

Solution Organization
The organizational structure of B2B Data Transformation solutions is summarized below. Element Service Repository Parent None. This is the top level organization structure for published Complex Data services. Repository. There may be multiple projects in a repository. Workspace. There may be multiple projects in a studio workspace. Project. Project. TGP Script. However naming is global to a project and not qualified by the TGP script name. TGP Script. However naming is global to a project and not qualified by the TGP script name.

Published Complex Data Service Project TGP Script XML Schema Parser, Mapper, Serializer

Global Variables, Actions, Markers

Planning for B2BDT Development


While the overall solution life cycle may encompass project management, analysis, architecture, design, implementation, test, deployment and operation following a methodology such as Informaticas Velocity methodology, from a development architecture perspective we are mainly concerned with facilitating actual implementation of transformations, subsequent test and deployment of those transformations.

Pre-Implementation Development Environment Requirements


During the analysis phase we are mainly concerned with identifying the business needs for data transformations, the related data characteristics (data format, size, volume, frequency, performance constraints) and any existing constraints on target host environments (if candidate target host environments have already been identified). Due to the nature of how B2BDT transformations are built with their utilization of sample data during the authoring process, we also need to plan for obtaining data and schema samples as part of the requirements gathering and architecture phases. Other considerations include identification of any security constraints on the use of storage of data and identification of the need to split data, any sizing and scaling of the eventual system which will depend to a large extent on the volume of data, performance constraints, responsiveness targets etc. For example, HIPAA specifications include privacy restrictions and constraints in addition to defining message and transactions formats.

Pre-Development Checklist
While many of these requirements address the solution or solutions as a whole, rather than the development environment
INFORMATICA CONFIDENTIAL BEST PRACTICES 66 of 954

specifically, there are a number of criteria that have direct impact on the development architecture:
q

What sample data will be used in creation of the B2BDT services? The size of the sample data used to create the B2BDT services will determine some of the memory requirements for development environments. Will specific preprocessors be required for B2BDT transformation authoring? Some preprocessors such as Excel or Word preprocessors require additional software such as Microsoft Office to be deployed to the development environments. In some cases, custom preprocessors and / or transformers may need to be created to facilitate authoring of transformation solutions. Are there specific libraries being used such as the B2BDT Accord, EDI or HIPAA libraries? Use of specific libraries will have an impact on how transformations are created and on the specific licenses required for development usage of the Complex Data Transformation tools. Are custom components being created such as custom actions, transformers, preprocessors that will be shared among developers of B2BDT transformations? In many cases, these custom components will need to be deployed to each B2BDT studio environment and a process needs to be defined for handling updates and distribution of these components Are there any privacy or security concerns? Will data need to be encrypted / decrypted? Will cleansed data be needed for use with learn-by-example based transformations? How will the B2BDT runtime be invoked? Via a platform adapter, custom code, command line, web services, HTTP etc.? Each of these communication mechanisms may impose specific development requirements with regard to testing of work in progress, licensing of additional B2BDT components, performance implications and design choices? Will data splitting be needed? Depending on the choice of 32 bit vs. 64 bit B2B Data Transformation runtimes, and both host software platform and underlying OS and hardware platform, data may need to be split through the use of B2BDT Streaming capabilities, custom transformations or preprocessors How are B2BDT transformations created? What artifacts affect their creation? What is the impact of changes to specifications, schemas, sample data, etc? In some cases such as spec driven transformation, changes to specifications go beyond design change requests but may require actual rerunning of transformations that produce other executable artifacts, documentation, test scripts etc.

Establishing the Development Environment


B2B Data Transformation services are defined and designed in the B2B Data Transformation Studio environment. The B2B Data Transformation Studio application is typically installed on the developers local machines and allows the visual definition of transformations, the usage of libraries and use of import processes to build one or more B2B Data Transformation services. All extensions used during authoring such as custom transformations, preprocessors, actions etc., must be installed in each B2BDT Studio installation. While preprocessors are provided with the studio environment to support manipulation of files types such as Excel, Word and PDF files within the studio environment. For some formats it may be necessary to create custom preprocessors to optimize usage of source data within the B2BDT studio environment. Note: In some cases, additional optional studio features may need to be licensed in order to access necessary preprocessors and / or libraries During transformation authoring, B2BDT services are organized as a set of B2BDT projects within a B2BDT workspace. Each B2BDT project consists of a set of transformation scripts, XML schema definitions, and sample data used in authoring and / or runtime of the transformation. B2BDT projects and workspaces use file system based artifacts for all aspects of the definition of a B2BDT project. Due to the use of file based artifacts for all B2BDT transformation components, traditional source code controls systems may be used to share work in progress.

Development Environment Checklist

INFORMATICA CONFIDENTIAL

BEST PRACTICES

67 of 954

Many of the implementation issues will be specific to the particular solution. However there are a number of common issues for most B2BDT development projects:
q

What is the host environment and what tools are required to develop and test against that environment? While the B2BDT studio is a Windows only environment, addition consideration may need to be given to the ultimate host environment regarding what tools and procedures are required to deploy the overall solution and troubleshoot it on the host environment.

What is the communication mechanism with the host environment? How does the host environment invoke B2BDT transformations? Is it required for work in progress testing or can the invocation method be simulated through the use of command line tools, scripts or other means?

What are security needs during development? Deployment? Test? How will they affect the development architecture? What are memory and resource constraints for the development, test and production environments? What other platform tools are needed during development? What naming conventions should be used? How will work be shared between developers? How will different versions of transformations be handled? Where or how are intermediate XML schemas defined and disseminated? Are they specific to individual services? Shared between services? Externally defined either by other project teams or by external standards bodies? What is the folder and workspace layout for B2BDT projects?

q q q q q q

Supporting Multiple Users


The B2BDT studio environment is intended to be installed on each developers own machine or environment. While advances in virtualization technologies and technologies such as Windows remote desktop connections theoretically allow for multiple users to share the same B2BDT installation, the B2BDT studio environment does not implement mechanisms such as file locking during authoring of transformations that are needed to secure multiple users from overwriting each others work. TIP If For PowerCenter users, it is important to note that B2BDT does not implement a server based repository environment for work in progress, and other mechanisms are needed to support sharing of work in progress. The service database may be shared between different production instances of B2BDT by locating it on a shared file system mechanism such as a network file share or SAN. The B2BDT development environment should be installed on each B2BDT transformation authors private machine

The B2BDT Studio environment does support multiple user usage of the same development environment. However, each user should be assigned a separate workspace. As the workspace, along with many other default B2BDT configuration parameters, is stored in the configuration file, the environment needs to be configured to support multiple configuration files, with one being assigned to each user

INFORMATICA CONFIDENTIAL

BEST PRACTICES

68 of 954

Creating Multiple Configurations To create multiple configurations, you can edit and copy the default configuration file.

1. Make a backup copy of the default configuration file, <INSTALL_DIR/CMConfig.xml>. At the end of the procedure, you must restore the backup to the original CMConfig.xml location. 2. Use the Configuration Editor to edit the original copy of CMConfig.xml. Save your changes. 3. Copy the edited CMConfig.xml to another location or another filename. 4. Repeat steps 2 and 3, creating additional versions of the configuration file. In this way, you can define as many configurations as you need. 5. Restore the backup that you created in step 1. This ensures that the default configuration remains as before.

Selecting the Configuration at Runtime You can set the configuration file that B2B Data Transformation Engine should use in any of the following ways: 1. Define an environment variable called IFConfigLocation4. The value of the variable must be the path of a valid configuration file, for example: 2. c:\MyIFConfigLocation4\CMConfig1.xml 3. On Unix only: Store the configuration file under the name CMConfig.xml, in the user's home directory. 4. Use the default configuration file, <INSTALL_DIR>/CMConfig.xml. When B2B Data Transformation Engine starts, it searches these locations in sequence. It uses the first configuration file that it finds. Example 1 Suppose you want to run two applications, which run B2B Data Transformation Engine with different configuration files. Each application should set the value of IFConfigLocation4 before starts B2B Data Transformation Engine. Example 2 Two users want to run B2B Data Transformation Engine with different configurations, on the same Unix-type platform. Store their respective configuration files in their home directories. Both files must have the name CMConfig.xml. Alternatively, store a CMConfig.xml file in the home directory of one of the users. The other user uses the default configuration file, <INSTALL_DIR>/CMConfig.xml. Multiple JREs On Windows platforms, the JVM Location parameter of the configuration file defines the JRE that B2B Data Transformation should use. By using multiple configuration files, you can switch JREs. On Unix-type systems, the configuration file does not contain a JVM Location parameter. To switch JREs, you must load a different environment-variable file. Running Multiple Configurations Concurrently B2B Data Transformation Engine loads the configuration file and the environment variables when it starts. After it starts, changing the configuration file or the environment variables has no effect. This means that two applications can use different configurations concurrently. Each application uses the configuration that was in affect when its instance of B2B Data Transformation Engine started.

While this can theoretically allow Windows based sharing mechanisms such as Remote Desktop Connection to share the same installation of B2BDT, it is important to specify different workspaces for each user due to the possibility of files being overwritten by different users.
INFORMATICA CONFIDENTIAL BEST PRACTICES 69 of 954

As a best practice, it is recommended that each user of B2BDT Studio is provided with a separate install of the B2BDT Studio environment on a dedicated machine. Sharing of work in progress should be accomplished through the use of a source control system rather than multiple users using the same workspace simultaneously. In this manner, each transformation authors environment is kept separate while allowing multiple authors to create transformations, and share them between each authors environment.

Using Source Code Control for Development


As B2BDT transformations are all defined as text based artifacts scripts, XML schema definitions, project files etc., B2BDT transformation authoring lends itself to good integration with traditional source code control systems. There are a number of suitable source code control systems on the market and open-source source code control environments such as CVSNT and Subversion both have Eclipse plug-ins available that simplify the process. While source code control is a good mechanism for sharing of work between multiple transformation authors, it also serves as a good mechanism for reverting to previous versions of a code base, keeping track of milestones and other change control aspects of a project. Hence it should be considered for all but the most trivial of B2B Data Transformation projects, irrespective of the number of transformation authors. What should be placed under source code control? All project files that make up a transformation should be checked in when a transformation project is checked in. These include sample data files, TGP script files, B2BDT project files (ending with the extension .CMW), and XML Schema definition files (ending with the extension .XSD). During test execution of transformations, the B2BDT engine and Studio environment will generate a results subdirectory in the project source directory. The files contained in this directory include temporary files generated during project execution under the studio environment (typically output.xml), and the Studio events file (ending in .CME). These should not be checked in, and should be treated as temporary files. When a service is deployed, a deploy. log file is generated in the project directory. While it may seem desirable to keep track of the deployment information, a different deployment log file will be generated on each users machine. What are the effects of different authoring changes? The following table describes the file system changes that occur when different actions are taken: Action Creating a new B2BDT project Importing an XML schema Change New B2BDT project directory created in workspace Schema and dependencies copied to B2BDT project directory

Adding a new script

New TGP file in B2BDT project directory Modifications to CMW file

INFORMATICA CONFIDENTIAL

BEST PRACTICES

70 of 954

Adding a new test data file Running a transformation within the studio environment

Files copies to B2BDT project directory Changes to the results directory. New entries in the Events. CME file Changes to the CMW file Modifications to the B2BDT project file Modifications to the meta data directory in workspace

Modifications to the project preferences Modifications to the studio preferences

During test execution of transformations, the B2BDT engine and Studio environment will generate a results subdirectory in the project source directory. The files contained in this directory include temporary files generated during project execution under the studio environment (typically output.xml), and the Studio events file (ending in .CME). These should not be checked in, and should be treated as temporary files. When a service is deployed, a deploy.log file is generated in the project directory. While it may seem desirable to keep track of the deployment information, a different deployment log file will be generated on each users machine.

Special Considerations for Spec Driven Transformation


Spec driven transformation is the use of a B2BDT transformation to generate a different B2BDT transformation based on one or more inputs that form the specification for a transformation. As a B2BDT transformation is itself a set of text files, it is possible to automate the generation of B2BDT transformations with or without subsequent user modification. Specifications may include Excel files that define mappings between source and target data formats, PDF files that are generated by some standards body or a variety of custom specification formats. As the specification itself becomes part of what determines the transformation scripts generated, these should be placed under source code control. In other cases, the time taken to generate the transformation may be too great to regenerate the transformations on every team merge event, or it may be necessary to preserve the generated transformation for compliance with auditing procedures. In these cases, it is necessary to place the generated transformations under source code control also.

Sharing B2BDT Services


In addition to defining production transformations, B2BDT supports the creation of shared components such as shared library transformations that may be shared using the "autoinclude" mechanism. B2BDT also supports the creation of custom transformers, preprocessors and other components that may be shared across users and B2BDT projects. These should all be placed under source code control and must also be deployed to production B2BDT environments, if used by production code. Note: For PowerCenter users, these can be thought to be the B2B Data Transformation equivalent of Mapplets and Worklets and offer many of the same advantages.

Sharing Metadata Between B2BDT Projects


Many B2BDT solutions are comprised of multiple transformation projects. Often there is shared metadata such as XML schema definitions, and other shared artifacts.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

71 of 954

When an XML schema is added to a project, local copies of the schemas along with any included schemas will be placed in the project directory. If one or more schemas are used in multiple projects, they must be copied to each project when a change occurs to the schema. One recommendation for sharing of schemas is to place the schemas and other artifacts into a dummy project and when the schema changes, transformation authors should sync that project and copy the schemas from the dummy project to each of the other projects. This copy mechanism can be added to synchronization scripts. In these cases, the local copy of the shared schema should not be placed under source control.

Using Multiple Workspaces


A typical B2BDT solution may be comprised of multiple B2BDT transformations, shared components, schemas, and other artifacts. These transformations and other components may be all part of the same logical aspect of a B2B Data Transformation solution, or may form separate logical aspects of a B2B Data Transformation solution. Some B2B Data Transformation solutions will result in the production of 100s of transformation services, parsers and B2BDT components. When the B2BDT Studio environment is launched, it will attempt to load all transformation projects into memory. While B2BDT Studio allows closing a project to conserve memory and system resources (by right clicking on the project and selecting the close option), large numbers of B2BDT projects can make use of the studio environment unwieldy. Closing a Project Within B2BDT Studio

Reopen Project Using Open Option

INFORMATICA CONFIDENTIAL

BEST PRACTICES

72 of 954

There may also be a need (due to complexity, security or other reasons) to separate work between different developers so that only some projects need to be opened within a workspace of a given developer. For these reasons number of transformations, separation of logical aspects of solution, enforcement of change control, it may be appropriate to use separate workspaces to partition the projects.

Staging Development Environments


When there are multiple developers on a project, and / or large numbers of transformations, it is recommended to have a staging development environment where all transformations are assembled prior to deployment to test environments. While it is possible to have each developer transfer their work to the staging development environment directly, it is recommended that the staging development environment is synchronized from the source code control system. This enforces good check in practices as only those transformations checked in will be propagated to the staging development environment. It is also possible to require that each developer publishes their working services to a local service DB on their machine and use source code control to check in their published services. If this approach is chosen, it should be considered in addition to, not instead of, using source code control to manage work in progress. In Agile development methodologies, one of the core concepts is always having a working build available at any time. By using source code control to manage working copies of deployed services, it is possible to enforce this concept. When the target platform is a non Windows platform, it is also necessary to consider where the version of the services for non Windows platforms should be assembled. For example you can assemble the version of B2BDT solution for non Windows platforms on the staging development machine and either transfer the transformation services to the QA environment manually or use additional check in/check out procedures to perform the transfer.

Synchronization of Changes from Source Code Control System


If a synchronization operation in a source code control system adds an additional project to the workspace, it is necessary to use the file import command in the B2BDT studio environment to import the project into the project workspace. If a change occurs to a schema while the studio environment is open, it is sometimes necessary to switch to the schema view in the studio environment to detect the schema change.
INFORMATICA CONFIDENTIAL BEST PRACTICES 73 of 954

Best Practices for Multi-Developer or Large Transformation Solutions


q q q q

DO install a separate instance of the B2BDT Studio environment on each authors machine DO use a source code control system to synchronize and share work in progress DO consider using a dummy project to share shared meta data and artifacts such as XML schemas DONT rely on Remote Desktop Connection to share simultaneous usage of B2BDT Studio for the same workspace DO use a separate workspace location for each user on the same machine. DO place shared components under version control DO define scripts to aid with synchronization of changes to shared resources such as schemas DO consider use of a staging development environment for projects with a large number of transformations , multiple transformation authors or non windows target platforms DO consider having identical folder structure, if each developer has dedicated machine

q q q q

Configuring the B2BDT Environment


B2BDT supports setting of configuration settings through a number of means. These include the B2BDT Configuration application (which modifies the CMConfig.xml configuration file), setting of global properties in the B2BDT Studio configuration environment, setting of project specific properties on a B2BDT project and through the use of platform environment variables. The B2BDT Configuration application allows setting of global B2BDT properties through a GUI based application. Changing property settings through the configuration application causes changes to be made to the CMConfig.XML file (once saved). B2BDT Configuration Application

The configuration application allows setting of properties such as JVM parameters, thread pool settings, memory available to the studio environment and many other settings. Consult the administrators guide for a full list of settings and their effects. Properties set using the B2BDT configuration application affect both the operation of the standalone B2BDT runtime environment in addition to affecting the behavior of the B2BDT studio environment.
INFORMATICA CONFIDENTIAL BEST PRACTICES 74 of 954

Within the B2BDT studio environment, properties may be changed for the Studio environment as a whole, and for on a project specific basis. B2BDT Studio Preferences

The B2BDT studio preferences allow customization of properties that affect all B2BDT projects such as what events are generated for trouble shooting, logging settings, auto save settings and other B2BDT Studio settings. B2BDT Project Properties

INFORMATICA CONFIDENTIAL

BEST PRACTICES

75 of 954

Project properties may be set in the B2BDT Studio environment specific to a B2BDT project. These include settings such as the encoding being used; namespaces used for XML Schemas, control over the XML generation, control over the output from a project and other project specific settings. Finally, OS environment variables are used to set aspects of the system such as the Java classpath to be used, location of the configuration file for a specific user, home location for the installed B2BDT instance to be used, library paths etc. The following table lists some typical configuration items and where they are set: Type of configuration item Where configured

Memory for Studio JVM / JRE usage Tuning parameters threads, timeouts etc User specific settings

B2BDT Configuration application B2BDT Configuration application B2BDT Configuration application Use environment variable to point to different configuration file B2BDT Configuration application

Memory for runtime

INFORMATICA CONFIDENTIAL

BEST PRACTICES

76 of 954

Transformation encoding, output , event generation settings Workspace location

Project properties

B2BDT Configuration application (B2BDT 4.3 formerly known as PowerExchange for Complex Data), B2BDT Studio (B2BDT 4.4) Set in project properties B2BDT Configuration application

Event generation Repository location

Development Configuration Settings


The following settings need to be set up for correct operation of the development environment:
q q q q q

Java home directory set using CMConfiguration | General | Java setting in Configuration editor Java maximum heap size set using CMConfiguration | General | Java setting in Configuration editor Repository location needed to deploy projects from within Studio JVM path for use with the Studio environment B2BDT Studio Eclipse Command line parameters used to set memory available in the studio environment. Use Xmx nnMB to set the max allocation pool to size nn MB. Use Xms nnMB to set the initial allocation pool to size of nn MB. Control over project output by default, automatic output is enabled. This needs to switched off for most production quality transformations Use of event files disable for production Use of working encoding

q q

For most development scenarios, a minimum of 2GB memory is recommended for authoring environments.

Development Security Considerations


The user under which the studio environment is running needs to have write access enabled to the directories where logging occurs, where event files are placed, read and write access to the workspace locations, and read and execute access to JVMs and any tools used in operation of preprocessors. The B2BDT transformation author needs read and execute permissions for the B2BDT install directory and all of its subdirectories. In some circumstances, the user under which a transformation is run differs from the logged in user. This is especially true when running under the control of a application integration platform such as BizTalk, or under a Web services host environment. Note: Under IIS, the default user identity for a web service is the local ASPNet user. This can be configured in the AppPool settings, in the .Net configuration settings and in the web service configuration files.

Best Practices Workspace Organization


As B2B Data Transformation Studio will load all projects in the current workspace into the studio environment, keeping all
INFORMATICA CONFIDENTIAL BEST PRACTICES 77 of 954

projects under design in a single workspace leads to both excessive memory usage and logical clutter between transformations belonging to different, possibly unrelated, solutions. Note: B2B Data Transformation Studio allows for the closing of projects to reduce memory consumption. While this aids with memory consumption it does not address the logical organization aspects of using separate workspaces. TIP Right click on the B2BDT project node in the explorer view to open or close a B2BDT project in the workspace. Closing a project reduces the memory requirements in the studio environment.

Separate Workspaces for Separate Solutions


For distinct logical solutions, it is recommended to use separate workspaces to organize B2BDT projects relating to separate solutions. The B2B Data Transformation Studio configuration editor may be used to set the current workspace:

Separate Transformation Projects for Each Distinct Service


From a logical organization perspective, it is easier to manage Complex Data solutions if only one primary service is published from each project. Secondary services from the same project should be reserved for the publication of test or troubleshooting variations of the same primary service. The one exception to this should be where multiple services are substantially the same with the same transformation code but with minor differences to inputs. One alternative to publishing multiple services from the same project is to publish a shared service which is then called by the other services in order to perform the common transformation routines. For ease of maintenance, it is often desirable to name the project after the primary service which it publishes. While these do not have to be the same, it is a useful convention and simplifies the management of projects.

Implementing B2BDT Transformations


INFORMATICA CONFIDENTIAL BEST PRACTICES 78 of 954

There are a number of considerations to be taken into account when looking at the actual implementation of B2BDT transformation services.
q q q q q q q

Naming standards for B2BDT components Determining need and planning for data splitting How will B2BDT be invoked at runtime Patterns of data input and output Error handling strategies Initial deployment of B2BDT transformations Testing of B2BDT transformations

Naming Standards for Development


While naming standards for B2BDT development are subject of a separate best practice, the key points can be summarized as follows:
q q q

B2BDT service names must be unique B2BDT project names must be unique Avoid use of file system names for B2BDT artifacts. For example do not use names such as CON: as it may conflict with file system names Avoid use of names inconsistent with programming models Consider that the B2BDT service name or service parameter may need to be passed as a web service parameter Consider that the B2BDT service name or service parameter may drive the naming of an identifier in Java, C, C# or C++ Avoid names invalid as command line parameters As authors may need to use command line tools to test the service, use names that may be passed as unadorned command line arguments. Dont use spaces, > etc. Only expose one key B2BDT service per project Only expose additional services for debug and troubleshooting purposes.

Data Splitting
There are a number of factors influencing whether source data should be split, how it can be split or indeed whether a splitting strategy is necessary. First of all, lets consider when data may need to be split. For many systems, the fundamental characteristic to consider is the size of the inbound data. For many EAI platforms, files or blobs in excess of 10mb can impose problems. For example PowerCenter, Process Server and BizTalk impose limits on how much XML can be processed. This depends on what operations are needed on the XML files (do they need to be parsed or are they just passed as files), the version of the platform software (64bit vs. 32) and other factors. A midrange B2BDT system can typically handle 100s MB of data for the same system that may only handle 10 MB on other systems. But there are additional considerations to take into account:
q q q

Converting flat file or binary data can result in 5x size for resulting XML Excel files > 10 MB can result in very large XML files depending on the choice of document processor in B2BDT B2BDT generates very large event files for file sources such as Excel files

In general files of < 10 MB in size can be processed in B2BDT without problem without splitting.
INFORMATICA CONFIDENTIAL BEST PRACTICES 79 of 954

When we consider use of the 64 bit version of B2BDT, we can handle a much greater volume of data without splitting. For example, existing solutions handle 1.6gb of XML input data on a dual processor machine with 16gb of ram at one customer (using X86 based 64 bit RHEL). Average processing time was 20 minutes per file. 32 bit Windows environments are often limited to 3 GB of memory (2 GB available to applications) so this can limit what may be processed. For development environments, much less memory will be available to process the file (especially when event generation is turned on). It is common practice to use much smaller files as data samples when operating in the Studio environment especially for files that require large amounts of memory to preprocess. For Excel files, sample files of 2mb or less are recommended, depending on file contents. B2BDT provides a built in streaming mechanism which supports splitting of files (although it does not support splitting of Excel files in the current release). Considerations for splitting using the streaming capabilities include:
q

Is there an natural boundary to split on? For example EDI functional groups, transactions and other constructs can be used to provide a natural splitting boundary. Batch files composed of multiple distinct files also provide natural splitting boundaries.

In general we cannot split a file if a custom document preprocessor is required to split the file.

In some cases, disabling the event generation mechanism will alleviate the need for splitting.

How Will B2BDT Be Invoked at Run Time?


B2BDT supports a variety of mechanisms for invocation: Invocation method Considerations

Command line

Command line tools are intended mainly for troubleshooting and testing. Use of command line tools does not span multiple CPU cores for transformations and always generate the event file in the current directory. Supports exposing B2BDT transformation via web server B2BDT services may be hosted in J2EE based web service environment. Service assets in progress will support hosting of B2BDT services as IIS based web services Offer great flexibility. Calling program needs to organize parallel calls to B2BDT to optimize throughput

HTTP (via CGI) Web services

APIS (C++, C, Java, .Net)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

80 of 954

EAI agents

Agents exist for BizTalk, WebMethods and many other platforms Through use of UDO, B2BDT services may be included as a transformation within a PowerCenter workflow

PowerCenter

In addition, B2BDT supports two modes of activation server and in-process operation. In process Server

B2BDT call runs in process space of caller

B2BDT service call results in call into other process

Can result in excessive initialization costs as each call may result in overhead for initialization especially with custom code client

Slower overall communication but can avoid initial startup overhead as process possibly remains alive between invocations

Fault in B2BDT service may result in failure in caller

In practice, web service invocation is sped up by use of server invocation

In measurements for custom BizTalk based system (not via standard agent), initial call took 3 seconds, subsequent calls .1 second. But if process is not kept alive, initial 3 second hit was incurred multiple times

No effect for studio or command line invocation

Not supported for some APIs

Can allow 64 bit process to activate 32 bit B2BDT runtime or vice versa

Patterns of Data Input and Output


There are a number of patterns of inputs and outputs used commonly in B2BDT transformations:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

81 of 954

Pattern

Description

Direct data

Under the direct data pattern, the data to be transformed is passed directly to the transformation and the output data is returned directly. Under this mechanism the output data format needs to allow for returning errors, or errors need to be returned through well known error file locations or some other preagreed mechanism.

Indirect via file

The transformation receives a string that designates a file to process and the transformation reads the real data from that file. A slightly more complex version of this may include passing of input, output and error file paths as semi-colon delimited strings or some similar mechanism Under the digest file mechanism, the data passed to the transformation specifies a wide range of parameters as a single file in a similar manner to a SOAP envelope. This digest file could contain many input file paths, output file paths, parameters to services, error handling arguments, performance characteristics etc. The processing of the digest file becomes much more complex but it is essential when many input files must be processed. It avoids much of the overhead of the host system having to load the data files into memory However transaction semantics offered by host systems cannot be utilized in these scenarios. This offers a great means for implementing custom error handling strategies also.

Indirect via digest or envelope file

Error Handling Strategies


B2BDT offers the following error handling features: Feature Description

INFORMATICA CONFIDENTIAL

BEST PRACTICES

82 of 954

B2BDT event log

This is a B2BDT specific event generation mechanism where each event corresponds to an action taken by a transformation such as recognizing a particular lexical sequence. It is useful in troubleshooting of work in progress but event files can grow very large, hence it is not recommended for production systems. It is distinct from the event system offered by other B2BDT products and from the OS based event system. Custom events can be generated within transformation scripts. Event based failures are reported as exceptions or other errors in the calling environment.

B2BDT Trace files

Trace files are controlled by the B2BDT configuration application. Automated strategies may be applied for recycling of trace files At the simplest levels custom errors can be generated as B2BDT events (using the AddEventAction). However if the event mechanism is disabled for memory or performance reasons, these are omitted. Other alternatives include generation of custom error files, integration with OS event tracking mechanisms and integration with 3rd party management platform software. Integration with OS eventing or 3rd party platform software requires custom extensions to B2BDT

Custom error information

Overall the B2BDT event mechanism is the simplest to implement. But for large or high volume production systems, the event mechanism can create very large event files, and it offers no integration with popular enterprise software administration platforms. It is recommended that B2BDT Events are used for troubleshooting purposed during development only. In some cases, performance constraints may determine the error handling strategy. For example updating an external event system may cause performance bottlenecks or producing a formatted error report can be time consuming. In some cases operator interaction may be required which could potentially block a B2BDT transformation from completing. Finally it is worth looking at whether some part of the error handling can be offloaded outside of B2BDT to avoid performance bottlenecks. When using custom error schemes, consider the following:
q q q

Multiple invocations of the same transformation may execute in parallel Dont hardwire error file paths Dont assume a single error output file
BEST PRACTICES 83 of 954

INFORMATICA CONFIDENTIAL

Avoid use of B2BDT event log for productions especially when processing Excel files

Effects of API on event generation: API / invocation method Event generation

CM_Console

Service deployed with events will produce events. Service deployed without events will not produce events Service runs without events. In case of error, service is rerun with events Same as Java No events unless error irrespective of how service was deployed. Same behavior is used for PowerCenter

Java API

C# / .Net

Agents

Testing
A full test of B2BDT services is covered by a separate best practice document. For simple cases and as a first step in most B2BDT transformation development projects, the B2BDT development environment offers a number of features that can be used to verify the correctness of B2BDT transformations. Initial testing of many transformations can be accomplished using these features alone. 1. The B2BDT studio environment provides visual feedback on what components of the input data are recognized by a parser. This can be viewed in the data browser window of a B2BDT project and the B2BDT studio environment will automatically mark up the first set of occurrences of patterns matched and literals found. Through the use of a simple menu option, all recognized occurrences of matched data can be marked up within the B2BDT studio authoring environment. 2. The B2BDT studio environment exposes a structured event log mechanism to allow developers to browser the flow of a transformation which can be used to verify execution of various components of a transformation. 3. The B2BDT studio environment supports specification of additional sources to perform a transformation on in order to verify the transformation execution against a set of sample or test data inputs. This is accomplished inside the studio design environment by simply setting the sources to extract property to point to the test data, either as specific files or as a directory search for data files matching a file pattern. The unit test can also be automated using the command line API. Results of transformations executed can be previewed in the studio environment, along with events generated during the transformation. In many production scenarios, the B2BDT transformation is called from an overall workflow process (EAI, ETL, MSMQ, etc), and this integrated environment is what is typically reflected in a lab environment (Dev/Test/QA). .

Deployment
Published B2BDT services are stored in a B2BDT repository which is a designated file system location where the B2BDT runtime looks for services when requested to invoke a transformation service. This may be on a shared file system location such as a Network share or SAN based mechanism facilitating the sharing of services between multiple
INFORMATICA CONFIDENTIAL BEST PRACTICES 84 of 954

production servers. A B2BDT project may be published within the B2BDT Studio environment to deploy a single B2BDT service to the B2BDT repository. A project can be used to deploy multiple B2BDT services by setting different options such as the transformation entry point (the same identical service can even be deployed under multiple B2BDT service names). At the simplest level, a B2BDT transformation may be deployed through one of two options. Direct The transformation deployment target directory is set using the CMConfiguration Editor. If the CM repository is set to a location such as a network share which is referenced by a production or QA environment, publishing the service will have the effect of making it available directly to the QA or production environment. Note: The refresh interval B2BDT configuration setting will determine how often a runtime instance checks the file system for updated services

Indirect The B2BDT transformation deployment target directory is set (via the CM repository configuration setting) to the developer specific directory. This directory is subsequently copied to the QA/Production environment using other mechanisms outside of the B2BDT Studio environment (simple copy or source management environments such as CVS, Source safe etc). Use of staging environments may be employed where it is necessary to assemble multiple dependent services prior to deployment to a test environment. In the section on source code control, we covered a number of strategies for deployment of services using version control. Other alternatives may include the use of custom scripts, setup creation tools (such as InstallShield) Configuration Settings Affecting Deployment The following configuration setting affects how soon a newly deployed service is detected: Service refresh interval Further Considerations for Deployment More detailed descriptions of deployment scenarios will be provided in a separate best practice. Some of the considerations to be taken into account include:
Last updated: 30-May-08 19:24

INFORMATICA CONFIDENTIAL

BEST PRACTICES

85 of 954

Testing B2B Data Transformation Services Challenge


Establish a testing process that ensures support for team development of B2B Data Transformation (B2BDT) solutions, strategies for verification of scaling and performance requirements, testing for transformation correctness and overall unit and system test procedures as business and development needs evolve.

Description
When testing B2B Data Transformation services, the goal to keep in mind throughout the process is achieving the ability to test transformations for measurable correctness, performance and scalability. The testing process is broken into three main functions which are addressed through the test variants. The testing process scenarios addressed in this document include finding bugs/defects, achieving the ability to test and ensure functional compliance with desired specifications and ensuring compliance with industry standards/certifications. The success of the testing process should be based on a standard of measurable milestones that provide an assessment of overall transformation completion.

Finding Defects
The first topic to address within the QA process is the ability to find defects within the transformation and to test them against specifications for compliance. This process has a number of options available. Choose the best method to fulfill testing requirements. Based upon time and resource constraints In the testing process, the QA cycle refers to the ability to find, fix or defer errors and retest them until the error count reaches 0 (or specified target). To ensure compliance with defined specifications during the QA process, test basic functionality and ensure that outlying transformation cases behave as defined. For these types of tests ensure the behavior of failure cases fail as expected in addition to ensuring that transformation succeeds as expected.

Ensuring Compliance
Another integral part of the testing process with B2B Data Transformations is the validation of transformations against industry standards such as HIPAA. In order to test standardized output there needs to be a validation of well formed inputs and outputs such as HIPAA levels 1-6 and testing against a publicly available data set. An optimally tested solution can be ensured through use of 3rd party verification software, validation support in the B2B Data Transformation libraries that verify data compliance or through B2BDT transformations created in the course of a project specifically for test purposes.

Performance
Performance and stress testing are additional components used within the testing methodology for B2BDT transformations. To effectively test performance, compare the effects of different configurations on the Informatica server. To achieve this, compare the effects of configurations parameters based on server and machine configurations. Based on data sizes and the complexity of transformations, optimize server configurations for best and worst case scenarios. One way to track benchmarking results is to create a reference spreadsheet. This should define the amount of time needed for each source file to process through the transformation based upon file size.

Setting Measurable Milestones

INFORMATICA CONFIDENTIAL

BEST PRACTICES

86 of 954

In order to track the progress of testing transformations it is best to set milestones to gauge the overall efficiency of the development and QA processes. Best practices include tracking failure rates for different builds. This builds a picture of pass/failure rate over time which can be used to determine expected delays and to gauge achievements in development over time.

Testing Practices The Basics


This section focuses on the initial testing of a B2B Data Transformation. For simple cases and as a first step in most transformation development projects, the studio development environment offers a number of features that can be used to verify the correctness of B2B Data transformations. The initial testing of many transformations can be accomplished using these features alone. It is useful to create small sample data files that are representative of the actual data to ensure quick load times and responsiveness. 1. The B2B Data Transformation Studio environment provides visual feedback on which components of the input data are recognized by a parser. This can be viewed in the data browser window of a B2B Data Transformation project. The Studio environment will automatically mark up the first set of occurrences of patterns matched and literals found. Through the use of the mark all menu option or button, all recognized occurrences of matched data can be marked up within the Studio authoring environment. This provides for a quick verification of correct operations. As shown in the figure below, the color coding indicates which data was matched.

2. The Studio environment exposes a structured event log mechanism that allows developers to browse the flow of a transformation which can then be used to verify the execution of various components of a transformation. Reviewing the event log after running the transformation often provides an indication for the error.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

87 of 954

3. Viewing the results file provides a quick indication of which data was matched. By default it contains parsed XML data. Through the use of DumpValues statements and WriteValue statements in the transformation, the contents of the results files can be customized.

4. The Studio environment supports the specification of additional sources to perform a transformation on in order to verify the transformations execution against a set of sample or test data inputs. This is accomplished inside the Studio Design environment by simply setting the sources to extract property to point to the test data, either as specific files or as a directory search for data files matching a file pattern. The unit test can also be automated using the command line API. Results of transformations executed can be previewed in the Studio environment, along with events generated during the transformation. When running through the initial test process, the Studio environment provides a basic indication about the overall integrity of the transformation. These tests allow for simple functional checks to see whether the transformation failed or not and if the correct output was produced. The events navigation pane provides a visual description of transformation processing. An illustration of the events view log within Studio is shown below.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

88 of 954

In the navigation pane, blue flags depict warnings that can be tested for functional requirements whereas red flags indicate fatal errors. Event logs are available when running a transformation from the Studio environment. Once a service has been deployed (with event output turned on) event logs are written to the directory from which CM_Console is run (when testing a service with CM_Console). When invoking a service with other invocation mechanisms the following rules apply for event log generation.

Effects of API on Event Generation


API / invocation method CM_Console Event generation Service deployed with events enabled will produce events. Service deployed without events enabled will not produce events Service runs without events. In case of error, service is rerun automatically with events enabled Same as Java No events unless error irrespective of how service was deployed. Same behavior is used for PowerCenter

Java API

C# / .Net Agents

INFORMATICA CONFIDENTIAL

BEST PRACTICES

89 of 954

To view the error logs, use the Studio event pane to scan through the log for specific events. To view an external event file (usually named events.cme), drag and drop the file from the Windows Explorer into the B2BDT Studio events pane. To view the error logs, use the Studio event pane to scan through the log for specific events. To view an external event file (usually named events.cme), drag and drop the file from the Windows Explorer into the B2BDT Studio events pane. It is also possible to create a B2BDT transformation to look for specific information within the event file.

Other Troubleshooting Output


B2B Data Transformation services can be configured to produce trace files that can be examined for troubleshooting purposes. Trace file generation is controlled by the B2BDT configuration application. Automated strategies may be applied for the recycling of trace files. For other forms of troubleshooting output the following options are available:
q

Simple (non dynamic) custom errors can be generated as B2BDT events (using the AddEventAction). However if the event mechanism is disabled for memory or performance reasons, these are omitted. A transformation could be used to keep track of errors in the implementation of the transformation and output these to a custom error file. Through the use of the external code integration APIs for Java, COM (and .Net), integration with OS event tracking mechanisms and integration with 3rd party management platform software are possible through custom actions and custom data transformations.

Other Test Methods


Additional checks that can be performed include a comparison with well known input and expected output, the use of validation tools and transformations as well as the use of reverse transformations and spot checks to verify the expected data subsets. The sections below provide information on how each of these different testing options work along with descriptions of their overall efficiencies and deficiencies for the QA process.

Comparing Inputs and Outputs


For many transformations, comparing the data output from known good input data with expected output data generated through other means provides a valuable mechanism for testing the correctness of a transformation. However, this process requires that adequate sample input data is available as well as examples of output data for these inputs. While in some cases simple binary comparison between the generated output and the correct output is sufficient, it may be necessary to use 3rd party tools to perform comparison where the output is XML or where the order of output can vary. Another test that is valid for some transformations is to test if the output data contains a subset of the expected data. This is useful if only part of the expected output is known. Comparison techniques may need to ignore time and date stamp data in files unless they are expected to be the same in the output. If no comparison tools are available due to the complexity of the data, it is also possible to create a B2BDT service that performs the comparison and writes the results of the comparison to the results file or to a specific output file.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

90 of 954

In the event that there is no sample data output available, one solution is to run well known good data through the transformation and create a set of baseline outputs. These should be verified for correctness either through manual examination or another method. This baseline output data can subsequently be used for comparison techniques and for the creation of further variations of expected output data. While this does not verify the correctness of the initial execution of the data transformation, the saved baseline output data can be used to verify that expected behavior has not been broken by maintenance changes. Tools that can be used for the comparison of inputs and outputs include 3rd party software applications such as KDiff3, an open source comparison tool. This application is good for the comparison of XML as well as text files (see an example in the figure below).

Validation Transformations
For some types of data, validation software is available commercially or may already exist in an organization. In the absence of available commercial or in-house validation software Informatica recommends creating B2BDT services that provide validation of the data. The developers assigned to create the validation transformations should be different from those that created the original transformations. A strict no code sharing rule should be enforced to ensure that the validation is not simply a copy of the transformation.

Reverse Transformations:
Another option for testing practices is to use reverse transformations that is, a transformation that performs a reverse transformation on the output which would create the input data. This could then be used as the basis for comparison techniques. Running the output data from B2B Data transformations through an independently created reverse transformation is optimal. The reason for the independent creation of a reverse transformation is because an auto generated reverse
INFORMATICA CONFIDENTIAL BEST PRACTICES 91 of 954

transformations has a tendency to propagate additional bugs. Partial or full compares of input against the output of the reverse transformation can be performed using this strategy. While this allows for testing of functional compliance, the downside is a reduction in the effectiveness of auto generated functions as they require a high time cost to fully implement.

Spot Checking
In some cases it may not be feasible to perform a full comparison test on outputs. Creating a set of spot check transformations provides some measure of quality assurance. The basic concept is that one or more transformations are created that perform spot checks on the output data using B2BDT services. As new issues arise in QA, enhance the spot checks to detect new problems and to look for common mistakes in the output. As time progresses a library of checks should be enhanced. Programmatic checks can be embedded within the transformation itself such as inserting actions to self test output using the AddEventAction feature. If the B2B Data Transformation service is being called through an API, exceptions within the calling code can be checked for as well. This is a subset of spot checking which can assist within the testing process. An error tracking layer can also be applied to the XML output and through the use of programmatic checks all errors associated with the transformation can be written to the output XML. The figure below illustrates how to embed programmatic checks within the transformation.

In the example above, flags are set and error codes are assigned to the specific XML error fields that were defined in the XML Schema definition earlier. In the event the ensure condition fails, then the error flags are set and reported to the output XML stream.

Unit Testing
The concept behind unit testing is to avoid using a traditional QA cycle to find many basic defects in the transformation in order to reduce the cost in time and effort. Unit tests are sets of small tests that are run by the developer of a transformation before signing off on code changes. Unit tests optimally should be created and maintained by the developer and should be used for regression control and functionality testing. Unit tests are often used with a test-first development methodology. It is important to note that unit tests are not a replacement for full QA processes but provide a way for developers to quickly verify that functionality has not been broken by changes. Unit tests may be programmatic or manual tests, although implementing unit tests as a programmatic set of tests necessitates running of the unit test cases after every change.
INFORMATICA CONFIDENTIAL BEST PRACTICES 92 of 954

Testing Transformations Integrated with PowerCenter


When testing B2B Data Transformations using PowerCenter, it is best to initially test the transformation using the aforementioned test processes before utilizing the transformation within the mapping. However, using B2B Data Transformations with PowerCenter has its advantages as data output within a PC mapping can actually be visualized as it comes out of each transformation during the debugging process. When using a combination of PC with B2B Data Transformations, write the output to a flat file to allow for quick spot check testing practices.

Design Practices to Facilitate Testing Use of Indirect Pattern for Parameters


When initiating the testing process for B2B Data Transformations one way to induce the testing process is through the use of indirect pattern for parameters. This is similar to referencing the source input in a parameter file for testing purposes. In this instance set input to the transformation service as a request file specified by host location. This request file has the flexibility to indicate where to read the input and where to place the output and reports on the status of executing transformations. This can be done through an XML file input which can be managed by the local administrator. This method can result in the reduction of the host environment footprint. Staging areas for inputs and outputs can be created which provide a way to easily track completed transformations. During the mapping process, the request file is processed to determine the actual data to be mapped along with the target locations, etc. When these have been read, control is passed to the transformation which will perform the actual mapping.The figures below demonstrate this strategy.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

93 of 954

In the mapper illustrated above, the main service input and output data takes the form of references (provided as individual service parameters or combined into a single XML block) which refer to the real input and output data located by paths to specific files and/or collections of files designated by a path to an accessible directory. Alternately, a collection of files may be referred to using a sequence of individual paths. However, the latter approach does limit the parallel operation of some of the transformation.
Last updated: 30-May-08 23:55

INFORMATICA CONFIDENTIAL

BEST PRACTICES

94 of 954

Configuring Security Challenge


Configuring a PowerCenter security scheme to prevent unauthorized access to folders, sources and targets, design objects, run-time objects, global objects, security administration, domain administration, tools access, and data in order to ensure system integrity and data confidentiality.

Description
Security is an often overlooked area within the Informatica ETL domain. However, without paying close attention to the domain security, one ignores a crucial component of ETL code management. Determining an optimal security configuration for a PowerCenter environment requires a thorough understanding of business requirements, data content, and end-user access requirements. Knowledge of PowerCenter's security functionality and facilities is also a prerequisite to security design. Implement security with the goals of easy maintenance and scalability. When establishing domain security, keep it simple. Although PowerCenter includes the utilities for a complex web of security, the more simple the configuration, the easier it is to maintain. Securing the PowerCenter environment involves the following basic principles:
q q q

Create users and groups Define access requirements Grant privileges, roles and permissions

Before implementing security measures ask and answer the following questions:
q q

Who will administer the domain? How many projects need to be administered? Will the administrator be able to manage security for all PowerCenter projects or just a select few? How many environments will be supported in the domain? Who needs access to the domain objects (e.g., repository service, reporting service, etc.)? What do they need the ability to do? How will the metadata be organized in the repository? How many folders will be required? Where can we limit repository service privileges by granting folder permissions instead? Who will need Administrator or Super User-type access?

q q

q q q

After you evaluate the needs of the users, you can create appropriate user groups and assign repository service privileges and folder permissions. In most implementations, the administrator takes care of maintaining the repository. Limit the number of administrator accounts for PowerCenter. While this concept is important in a development/unit test environment, it is critical for protecting the production environment.

Domain Repository Overview


All of the PowerCenter Advanced Edition applications are centrally administered through the administration console and the settings are stored in the domain repository. User and group information, permissions and role definitions for domain objects are managed through the administration console and are stored in the domain repository.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

95 of 954

Although privileges and roles are assigned to users and group centrally from the administration console, they are also stored in each application repository. Periodically the domain synchronizes this information (when an assignment is made) to each application repository. Individual applications object permissions are also managed and stored within each application repository.

PowerCenter Repository Security Overview


A security system needs to properly control access to all sources, targets, mappings, reusable transformations, tasks, and workflows in both the test and production repositories. A successful security model needs to support all groups in the project lifecycle and also consider the repository structure. Informatica offers multiple layers of security, which enables you to customize the security within your data warehouse environment. Metadata level security controls access to PowerCenter repositories, which contain objects grouped by folders. Access to metadata is determined by the privileges granted to the user or to a group of users and the access permissions granted on each folder. Some privileges do not apply by folder, as they are granted by privilege alone (i. e., repository-level tasks). Just beyond PowerCenter authentication is the connection to the repository database. All client connectivity to the repository is handled by the PowerCenter Repository Service over a TCP/IP connection. The particular database account and password is specified at installation and during the configuration of the Repository Service. Developers need not have knowledge of this database account and password; they should only use their individual repository user ids and passwords. This information should be restricted to the administrator. Other forms of security available in PowerCenter include permissions for connections. Connections include database, FTP, and external loader connections. These permissions are useful when you want to limit access to schemas in a relational database and can be set-up in the Workflow Manager when source and target connections are defined. Occasionally, you may want to restrict changes to source and target definitions in the repository. A common way to approach this security issue is to use shared folders, which are owned by an Administrator or Super User. Granting read access to developers on these folders allows them to create read-only copies in their work folders.

PowerCenter Security Architecture


The following diagram, Informatica PowerCenter Security, depicts PowerCenter security, including access to the repository, Repository Service, Integration Service and the command-line utilities pmrep and pmcmd. As shown in the diagram, the repository service is the central component for repository metadata security. It sits between the PowerCenter repository and all client applications, including GUI tools, command line tools, and the Integration Service. Each application must be authenticated against metadata stored in several tables within the repository. Each Repository Service manages a single repository database where all security data is stored as part of its metadata; this is a second layer of security. Only the Repository Service has access to this database; it authenticates all client applications against this metadata.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

96 of 954

Repository Service Security


Connection to the PowerCenter repository database is one level of security. The Repository Service uses native drivers to communicate with the repository database. PowerCenter Client tools and the Integration Service communicate with the Repository Service over TCP/IP. When a client application connects to the repository, it connects directly to the Repository Service process. You can configure a Repository Service to run on multiple machines, or nodes, in the domain. Each instance running on a node is called a Repository Service process. This process accesses the database tables and performs most repository-related tasks. When the Repository Service is installed, the database connection information is entered for the metadata repository. At this time you need to know the database user id and password to access the metadata repository. The database user id must be able to read and write to all tables in the database. As a developer creates, modifies, executes mappings and sessions, this information is continuously updating the metadata in the repository. Actual database security should be controlled by the DBA responsible for that database, in conjunction with the PowerCenter Repository Administrator. After the Repository Service is installed and started, all subsequent client connectivity is automatic. The database id and password are transparent at this point.

Integration Service Security


Like the Repository Service, the Integration Service communicates with the metadata repository when it executes workflows or when users are using Workflow Monitor. During configuration of the Integration Service, the repository database is identified with the appropriate user id and password. Connectivity to the repository is made using native
INFORMATICA CONFIDENTIAL BEST PRACTICES 97 of 954

drivers supplied by Informatica. Service, the repository database is identified with the appropriate user id and password. Connectivity to the repository is made using native drivers supplied by Informatica. Certain permissions are also required to use the pmrep and pmcmd command line utilities.

Encrypting Repository Passwords


You can encrypt passwords and create an environment variable to use with pmcmd and pmrep. For example, you can encrypt the repository and database passwords for pmrep to maintain security when using pmrep in scripts. In addition, you can create an environment variable to store the encrypted password. Use the following steps as a guideline to use an encrypted password as an environment variable: 1. Use the command line program pmpasswd to encrypt the repository password. 2. Configure the password environment variable to set the encrypted value. To configure a password as an environment variable on UNIX: 1. At the command line, type: pmpasswd <repository password> pmpasswd returns the encrypted password. 2. In a UNIX C shell environment, type: setenv <Password_Environment_Variable> <encrypted password> In a UNIX Bourne shell environment, type: <Password_Environment_Variable> = <encrypted password> export <Password_Environment_Variable> You can assign the environment variable any valid UNIX name. To configure a password as an environment variable on Windows: 1. At the command line, type: pmpasswd <repository password> pmpasswd returns the encrypted password. 2. Enter the password environment variable in the Variable field. Enter the encrypted password in the Value field.

Setting the Repository User Name


For pmcmd and pmrep, you can create an environment variable to store the repository user name.
INFORMATICA CONFIDENTIAL BEST PRACTICES 98 of 954

To configure a user name as an environment variable on UNIX: 1. In a UNIX C shell environment, type: setenv <User_Name_Environment_Variable> <user name> 2. In a UNIX Bourne shell environment, type: <User_Name_Environment_Variable> = <user name> export <User_Name_Environment_Variable> = <user name> You can assign the environment variable any valid UNIX name. To configure a user name as an environment variable on Windows: 1. Enter the user name environment variable in the Variable field. 2. Enter the repository user name in the Value field.

Connection Object Permissions


Within Workflow Manager, you can grant read, write, and execute permissions to groups and/or users for all types of connection objects. This controls who can create, view, change, and execute workflow tasks that use those specific connections, providing another level of security for these global repository objects. Users with Use Workflow Manager permission can create and modify connection objects. Connection objects allow the PowerCenter server to read and write to source and target databases. Any database the server can access requires a connection definition. As shown below, connection information is stored in the repository. Users executing workflows need execution permission on all connections used by the workflow. The PowerCenter server looks up the connection information in the repository, and verifies permission for the required action. If permissions are properly granted, the server reads and writes to the defined databases, as specified by the workflow.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

99 of 954

Users
Users are the fundamental objects of security in a PowerCenter environment. Each individual logging into the PowerCenter domain or its services should have a unique user account. Informatica does not recommend creating shared accounts; unique accounts should be created for each user. Each domain user needs a user name and password, provided by the Informatica Administrator, to access the domain. Users are created and managed through the administration console. Users should change their passwords from the default immediately after receiving the initial user id from the Administrator. When you create a PowerCenter repository, the repository automatically creates two default repository users within the domain:
q q

Administrator - The default password for Administrator is Administrator. Database user - The username and password used when you created the repository.

These default users are in the Administrators user group, with full privileges within the repository. They cannot be deleted from the repository, nor have their group affiliation changed. To administer repository users, you must have one of the following privileges:
q q

Administer Repository Super User

LDAP (Lightweight Directory Access Protocol)


In addition to default domain user authentication, LDAP can be used to authenticate users. Using LDAP authentication, the domain maintains an association between the domain user and the external login name. When a user logs into the domain services, the security module authenticates the user name and password against the external directory. The domain maintains a status for each user. Users can be enabled or disabled by modifying this status. Prior to implementing LDAP, the administrator must know:
q q q

Domain username and password An administrator or superuser user name and password for the domain An external login name and password

To configure LDAP, follow these steps: 1. Edit ldap_authen.xml, modify the following attributes:
q q

NAME the .dll that implements the authentication OSTYPE Host operating system

2. Register ldap_authen.xml in the Domain Administration Console. 3. In the domain Administration Console, configure the authentication module.

Privileges
Seven categories of privileges have been defined. Depending on the category, each privilege controls various actions for a particular object type. The categories are:
INFORMATICA CONFIDENTIAL BEST PRACTICES 100 of 954

q q q q q q q q

Folders -- Create, Copy, Manage Versions Sources & Targets -- Edit, Create and Delete, Manage Versions Design Objects -- Edit, Create and Delete, Manage Versions Run-time Objects -- Edit, Create and Delete, Manage Versions, Monitor, Manage Execution Global Objects (Queries, Labels, Connections, Deployment Groups) Create Security Administration -- Manage, Grant Privileges and Permissions Domain Administration (Nodes, Grids, Services) Execute, Manage, Manage Execution Tools Access Designer, Workflow Manager, Workflow Monitor, Administration Console, Repository Manager

Assigning Privileges
A user must have permissions to grant privileges and roles (as well as administration console privileges in the domain) in order to assign privileges. The user must also have permission for the service to which the privileges apply. Only a user who has permissions to the domain can assign privileges in the domain. For PowerCenter, only a user who has permissions to the repository service can assign privileges for that repository service. For Metadata Manager and Data Analyzer, only a user who has permissions to the corresponding metadata or reporting service can assign privileges in that application. Privileges are assigned per repository or application instance. For example, you can assign a user create, edit, and delete privilege for runtime and design objects in a development repository but not in the production repository.

Roles
A user needs to have privileges to manage users, groups and roles (and administration console privileges in the domain) in order to define custom roles. Once roles are defined they can be assigned to users or groups for specific services. Just like privileges, roles are assigned per repository or application instance. For example, the developer role (with its associated privileges) can be assigned to a user only in the development repository; but not the test or production repository. A must have permissions to grant privileges and roles (as well as administration console privileges in the domain) in order to assign roles. The user must also have permission for the services to which the roles are to be applied. Only a user who has permissions to the domain can assign roles in the domain. For PowerCenter, only a user who has permissions to the repository service can assign roles for that repository service. For Metadata Manager and Data Analyzer, only a user who has permissions to the corresponding metadata or reporting service can assign roles in that application.

Domain Administrator Role


The domain administrator role is essentially a super-user for not only the domain itself, but also for all of the services/ applications in the domain. This role has permissions to all objects in the domain (including the domain itself) and all available privileges in the domain. As a result, the super-user role has privileges to manage users, groups and roles as well as to assign privileges and roles privileges. Because of these privileges and permissions for all objects in the domain this role can grant itself the administrator role on all services and therefore, become the super-user for all services in the domain. The domain administrator role also has implicit privileges that include:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

101 of 954

q q q q q q q q

Configuring a node as a gateway node Creating, editing, and deleting the domain Configuring SMTP Configuring service levels in the domain Shutting down domain Receiving domain alerts Exporting and truncating domain logs Configuring restart of service processes

Audit Trails
You can track changes to Repository users, groups, privileges, and permissions by selecting the SecurityAuditTrail configuration option in the Repository Service properties in the PowerCenter Administration Console. When you enable the audit trail, the Repository Service logs security changes to the Repository Service log. The audit trail logs the following operations:
q q q q q q q

Changing the owner, owner's group, or permissions for a folder. Changing the password of another user. Adding or removing a user. Adding or removing a group. Adding or removing users from a group. Changing global object permissions. Adding or removing user and group privileges.

Sample Security Implementation


1. The following steps provide an example of how to establish users, groups, permissions, and privileges in your environment. Again, the requirements of your projects and production systems should dictate how security is established. 2. Identify users and the environments they will support (e.g., Development, UAT, QA, Production, Production Support, etc.). 3. Identify the PowerCenter repositories in your environment (this may be similar to the basic groups listed in Step 1; for example, Development, UAT, QA, Production, etc.). 4. Identify which users need to exist in each repository. 5. Define the groups that will exist in each PowerCenter Repository. 6. Assign users to groups. 7. Define privileges for each group. The following table provides an example of groups and privileges that may exist in the PowerCenter repository. This example assumes one PowerCenter project with three environments co-existing in one PowerCenter repository. GROUP NAME ADMINISTRATORS FOLDER All FOLDER PERMISSIONS All PRIVILEGES Super User (all privileges)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

102 of 954

DEVELOPERS

Individual development folder; integrated development folder

Read, Write, Execute

Use Designer, Browse Repository, Use Workflow Manager Use Designer, Browse Repository, Use Workflow Manager Use Designer, Browse Repository, Use Workflow Manager Use Designer, Browse Repository, Use Workflow Manager Browse Repository, Workflow Operator

DEVELOPERS

UAT

Read

UAT

UAT working folder Read, Write, Execute

UAT OPERATIONS

Production Production Production maintenance folders

Read Read, Execute

PRODUCTION SUPPORT PRODUCTION SUPPORT

Read, Write, Execute

Use Designer, Browse Repository, Use Workflow Manager

Production

Read

Browse Repository

Informatica PowerCenter Security Administration


As mentioned earlier, one individual should be identified as the Informatica Administrator. This individual is responsible for a number of tasks in the Informatica environment, including security. To summarize, here are the security-related tasks an administrator is responsible for:
q q q q q q q

Creating user accounts. Defining and creating groups. Defining and granting permissions. Defining and granting privileges and roles. Enforcing changes in passwords. Controlling requests for changes in privileges. Creating and maintaining database, FTP, and external loader connections in conjunction with database administrator. Working with operations group to ensure tight security in production environment.

Summary of Recommendations
When implementing your security model, keep the following recommendations in mind:
Last updated: 04-Jun-08 15:34

INFORMATICA CONFIDENTIAL

BEST PRACTICES

103 of 954

Data Analyzer Security Challenge


Using Data Analyzer's sophisticated security architecture to establish a robust security system to safeguard valuable business information against a range of technologies and security models. Ensuring that Data Analyzer security provides appropriate mechanisms to support and augment the security infrastructure of a Business Intelligence environment at every level.

Description
Four main architectural layers must be completely secure: user layer, transmission layer, application layer and data layer. Users must be authenticated and authorized to access data. Data Analyzer integrates seamlessly with the following LDAP-compliant directory servers:

SunOne/iPlanet Directory Server

4.1

INFORMATICA CONFIDENTIAL

BEST PRACTICES

104 of 954

Sun Java System Directory Server

5.2

Novell eDirectory Server 8.7 IBM SecureWay Directory IBM SecureWay Directory IBM Tivoli Directory Server 3.2

4.1

5.2

Microsoft Active Directory 2000 Microsoft Active Directory 2003

In addition to the directory server, Data Analyzer supports Netegrity SiteMinder for centralizing authentication and access control for the various web applications in the organization.

Transmission Layer
The data transmission must be secure and hacker-proof. Data Analyzer supports the standard security protocol Secure Sockets Layer (SSL) to provide a secure environment.

Application Layer
Only appropriate application functionality should be provided to users with associated privileges. Data Analyzer provides three basic types of application-level security:
q

Report, Folder and Dashboard Security. Restricts access for users or groups to specific reports, folders, and/or dashboards. Column-level Security. Restricts users and groups to particular metric and attribute columns. Row-level Security. Restricts users to specific attribute values within an attribute column of a table.

q q

Components for Managing Application Layer Security


Data Analyzer users can perform a variety of tasks based on the privileges that you grant them. Data Analyzer provides the following components for managing application layer security:
q

Roles. A role can consist of one or more privileges. You can use system roles or create custom roles. You can grant roles to groups and/or individual users. When you edit a custom role, all
BEST PRACTICES 105 of 954

INFORMATICA CONFIDENTIAL

groups and users with the role automatically inherit the change.
q

Groups. A group can consist of users and/or groups. You can assign one or more roles to a group. Groups are created to organize logical sets of users and roles. After you create groups, you can assign users to the groups. You can also assign groups to other groups to organize privileges for related users. When you edit a group, all users and groups within the edited group inherit the change. Users. A user has a user name and password. Each person accessing Data Analyzer must have a unique user name. To set the tasks a user can perform, you can assign roles to the user or assign the user to a group with predefined roles.

Types of Roles
q

System roles - Data Analyzer provides a set of roles when the repository is created. Each role has sets of privileges assigned to it. Custom roles - The end user can create and assign privileges to these roles.

Managing Groups
Groups allow you to classify users according to a particular function. You may organize users into groups based on their departments or management level. When you assign roles to a group, you grant the same privileges to all members of the group. When you change the roles assigned to a group, all users in the group inherit the changes. If a user belongs to more than one group, the user has the privileges from all groups. To organize related users into related groups, you can create group hierarchies. With hierarchical groups, each subgroup automatically receives the roles assigned to the group it belongs to. When you edit a group, all subgroups contained within it inherit the changes. For example, you may create a Lead group and assign it the Advanced Consumer role. Within the Lead group, you create a Manager group with a custom role Manage Data Analyzer. Because the Manager group is a subgroup of the Lead group, it has both the Manage Data Analyzer and Advanced Consumer role privileges.

Belonging to multiple groups has an inclusive effect. For example, if group 1 has access to something but group 2 is excluded from that object, a user belonging to both groups 1 and 2 will have access to the object.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

106 of 954

Preventing Data Analyzer from Updating Group Information


If you use Windows Domain or LDAP authentication, you typically modify the users or groups in Data Analyzer. However, some organizations keep only user accounts in the Windows Domain or LDAP directory service, but set up groups in Data Analyzer to organize the Data Analyzer users. Data Analyzer provides a way for you to keep user accounts in the authentication server and still keep the groups in Data Analyzer. Ordinarily, when Data Analyzer synchronizes the repository with the Windows Domain or LDAP directory service, it updates the users and groups in the repository and deletes users and groups that are not found in the Windows Domain or LDAP directory service. To prevent Data Analyzer from deleting or updating groups in the repository, you can set a property in the web.xml file so that Data Analyzer updates only user accounts, not groups. You can then create and manage groups in Data Analyzer for users in the Windows Domain or LDAP directory service. The web.xml file is in stored in the Data Analyzer EAR file. To access the files in the Data Analyzer EAR file, use the EAR Repackager utility provided with Data Analyzer. Note: Be sure to back-up the web.xml file before you modify it. To prevent Data Analyzer from updating group information in the repository: 1. In the directory where you extracted the Data Analyzer EAR file, locate the web.xml file in the following directory: /custom/properties 2. Open the web.xml file with a text editor and locate the line containing the following property: enableGroupSynchronization The enableGroupSynchronization property determines whether Data Analyzer updates the groups in the repository.
INFORMATICA CONFIDENTIAL BEST PRACTICES 107 of 954

3. To prevent Data Analyzer from updating group information in the Data Analyzer repository, change the value of the enableGroupSynchronization property to false: <init-param> <param-name> InfSchedulerStartup.com.informatica.ias. scheduler.enableGroupSynchronization </param-name> <param-value>false</param-value> </init-param> When the value of enableGroupSynchronization property is false, Data Analyzer does not synchronize the groups in the repository with the groups in the Windows Domain or LDAP directory service. 4. Save the web.xml file and add it back to the Data Analyzer EAR file. 5. Restart Data Analyzer. When the enableGroupSynchronization property in the web.xml file is set to false, Data Analyzer updates only the user accounts in Data Analyzer the next time it synchronizes with the Windows Domain or LDAP authentication server. You must create and manage groups, and assign users to groups in Data Analyzer.

Managing Users
Each user must have a unique user name to access Data Analyzer. To perform Data Analyzer tasks, a user must have the appropriate privileges. You can assign privileges to a user with roles or groups. Data Analyzer creates a System Administrator user account when you create the repository. The default user name for the System Administrator user account is admin. The system daemon, ias_scheduler/ padaemon, runs the updates for all time-based schedules. System daemons must have a unique user name and password in order to perform Data Analyzer system functions and tasks. You can change the password for a system daemon, but you cannot change the system daemon user name via the GUI. Data Analyzer permanently assigns the daemon role to system daemons. You cannot assign new roles to system daemons or assign them to groups. To change the password for a system daemon, complete the following steps: 1. Change the password in the Administration tab in Data Analyzer 2. Change the password in the web.xml file in the Data Analyzer folder. 3. Restart Data Analyzer.

Access LDAP Directory Contacts


INFORMATICA CONFIDENTIAL BEST PRACTICES 108 of 954

To access contacts in the LDAP directory service, you can add the LDAP server on the LDAP Settings page. After you set up the connection to the LDAP directory service, users can email reports and shared documents to LDAP directory contacts. When you add an LDAP server, you must provide a value for the BaseDN (distinguished name) property. In the BaseDN property, enter the Base DN entries for your LDAP directory. The Base distinguished name entries define the type of information that is stored in the LDAP directory. If you do not know the value for BaseDN, contact your LDAP system administrator.

Customizing User Access


You can customize Data Analyzer user access with the following security options:
q

Access permissions. Restrict user and/or group access to folders, reports, dashboards, attributes, metrics, template dimensions, or schedules. Use access permissions to restrict access to a particular folder or object in the repository. Data restrictions. Restrict user and/or group access to information in fact and dimension tables and operational schemas. Use data restrictions to prevent certain users or groups from accessing specific values when they create reports. Password restrictions. Restrict users from changing their passwords. Use password restrictions when you do not want users to alter their passwords.

When you create an object in the repository, every user has default read and write permissions for that object. By customizing access permissions for an object, you determine which users and/or groups can read, write, delete, or change access permissions for that object. When you set data restrictions, you determine which users and groups can view particular attribute values. If a user with a data restriction runs a report, Data Analyzer does not display the restricted data to that user.

Types of Access Permissions


Access permissions determine the tasks that you can perform for a specific repository object. When you set access permissions, you determine which users and groups have access to the folders and repository objects. You can assign the following types of access permissions to repository objects:
q q

Read. Allows you to view a folder or object. Write. Allows you to edit an object. Also allows you to create and edit folders and objects within a folder. Delete. Allows you to delete a folder or an object from the repository. Change permission. Allows you to change the access permissions on a folder or object.

q q

By default, Data Analyzer grants read and write access permissions to every user in the repository. You can use the General Permissions area to modify default access permissions for an object, or turn off default access permissions.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

109 of 954

Data Restrictions
You can restrict access to data based on the values of related attributes. Data restrictions are set to keep sensitive data from appearing in reports. For example, you may want to restrict data related to the performance of a new store from outside vendors. You can set a data restriction that excludes the store ID from their reports. You can set data restrictions using one of the following methods:
q

Set data restrictions by object. Restrict access to attribute values in a fact table, operational schema, real-time connector, and real-time message stream. You can apply the data restriction to users and groups in the repository. Use this method to apply the same data restrictions to more than one user or group. Set data restrictions for one user at a time. Edit a user account or group to restrict user or group access to specified data. You can set one or more data restrictions for each user or group. Use this method to set custom data restrictions for different users or groups

Types of Data Restrictions


You can set two kinds of data restrictions:
q

Inclusive. Use the IN option to allow users to access data related to the attributes you select. For example, to allow users to view only data from the year 2001, create an IN 2001 rule. Exclusive. Use the NOT IN option to restrict users from accessing data related to the attributes you select. For example, to allow users to view all data except from the year 2001, create a NOT IN 2001 rule.

Restricting Data Access by User or Group


You can edit a user or group profile to restrict the data the user or group can access in reports. When you edit a user profile, you can set data restrictions for any schema in the repository, including operational schemas and fact tables. You can set a data restriction to limit user or group access to data in a single schema based on the attributes you select. If the attributes apply to more than one schema in the repository, you can also restrict the user or group access from related data across all schemas in the repository. For example, you may have a Sales fact table and Salary fact table. Both tables use the Region attribute. You can set one data restriction that applies to both the Sales and Salary fact tables based on the region you select. To set data restrictions for a user or group, you need the following role or privilege:
q q

System Administrator role Access Management privilege

When Data Analyzer runs scheduled reports that have provider-based security, it runs reports against the data restrictions for the report owner. However, if the reports have consumer-based security, the Data Analyzer Server creates a separate report for each unique security profile.
INFORMATICA CONFIDENTIAL BEST PRACTICES 110 of 954

The following information applies to the required steps for changing admin user for weblogic only.

To change the Data Analyzer system administrator username on Weblogic 8.1(DA 8.1)
q

Repository authentication. You must use the Update System Accounts utility to change the system administrator account name in the repository. LDAP or Windows Domain Authentication. Set up the new system administrator account in Windows Domain or LDAP directory service. Then use the Update System Accounts utility to change the system administrator account name in the repository.

To change the Data Analyzer default users from admin, ias_scheduler/padaemon


1. Back up the repository. 2. Go to the Web Logic library directory: .\bea\wlserver6.1\lib 3. Open the file ias.jar and locate the file entry called InfChangeSystemUserNames.class 4. Extract the file "InfChangeSystemUserNames.class" into a temporary directory (example: d: \temp) 5. This extracts the file as 'd:\temp\repository tils\Refresh\InfChangeSystemUserNames.class' 6. Create a batch file (change_sys_user.bat) with the following commands in the directory D:\Temp \Repository Utils\Refresh\ REM To change the system user name and password REM ******************************************* REM Change the BEA home here REM ************************ set JAVA_HOME=E:\bea\wlserver6.1\jdk131_06 set WL_HOME=E:\bea\wlserver6.1 set CLASSPATH=%WL_HOME%\sql set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\jconn2.jar set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\classes12.zip set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\weblogic.jar set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias.jar set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias_securityadapter.jar set CLASSPATH=%CLASSPATH%;%WL_HOME%\infalicense REM Change the DB information here and also REM the user Dias_scheduler and -Dadmin to values of your choice REM ************************************************************* %JAVA_HOME%\bin\java-Ddriver=com.informatica.jdbc.sqlserver.SQLServerDriver-Durl=jdbc: informatica:sqlserver://host_name:port;SelectMethod=cursor;DatabaseName=database_name Duser=userName -Dpassword=userPassword -Dias_scheduler=pa_scheduler -Dadmin=paadmin repositoryutil.refresh.InfChangeSystemUserNames REM END OF BATCH FILE

INFORMATICA CONFIDENTIAL

BEST PRACTICES

111 of 954

7. Make changes in the batch file as directed in the remarks [REM lines] 8. Save the file and open up a command prompt window and navigate to D:\Temp\Repository Utils \Refresh\ 9. At the prompt, type change_sys_user.bat and press Enter. The user "ias_scheduler" and "admin" will be changed to "pa_scheduler" and "paadmin", respectively. 10. Modify web.xml, and weblogic.xml (located at .\bea\wlserver6.1\config\informatica\applications\ias \WEB-INF) by replacing ias_scheduler with 'pa_scheduler' 11. Replace ias_scheduler with pa_scheduler in the xml file weblogic-ejb-jar.xml This file is in iasEjb.jar file located in the directory .\bea\wlserver6.1\config\informatica\applications\ To edit the file Make a copy of the iasEjb.jar:
q q q q q q q

mkdir \tmp cd \tmp jar xvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar META-INF cd META-INF Update META-INF/weblogic-ejb.jar.xml replace ias_scheduler with pa_scheduler cd \ jar uvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar -C \tmp .

Note: There is a tailing period at the end of the command above. 12. Restart the server.

Last updated: 04-Jun-08 15:51

INFORMATICA CONFIDENTIAL

BEST PRACTICES

112 of 954

Database Sizing Challenge


Database sizing involves estimating the types and sizes of the components of a data architecture. This is important for determining the optimal configuration for the database servers in order to support the operational workloads. Individuals involved in a sizing exercise may be data architects, database administrators, and/or business analysts.

Description
The first step in database sizing is to review system requirements to define such things as:
q

Expected data architecture elements (will there be staging areas? operational data stores? centralized data warehouse and/or master data? data marts?) Each additional database element requires more space. This is even more true in situations where data is being replicated across multiple systems, such as a data warehouse maintaining an operational data store as well. The same data in the ODS will be present in the warehouse as well, albeit in a different format.

Expected source data volume It is useful to analyze how each row in the source system translates into the target system. In most situations the row count in the target system can be calculated by following the data flows from the source to the target. For example, say a sales order table is being built by denormalizing a source table. The source table holds sales data for 12 months in a single row (one column for each month). Each row in the source translates to 12 rows in the target. So a source table with one million rows ends up as a 12 million row table.

Data granularity and periodicity Granularity refers to the lowest level of information that is going to be stored in a fact table. Granularity affects the size of a database to a great extent, especially for aggregate tables. The level at which a table has been aggregated increases or decreases a table's row count. For example, a sales order fact table's size is likely to be greatly affected by whether the table is being aggregated at a monthly level or at a quarterly level. The granularity of fact tables is determined by the dimensions linked to that table. The number of dimensions that are connected to the fact tables affects the granularity of the table and hence the size of the table.

Load frequency and method (full refresh? incremental updates?)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

113 of 954

Load frequency affects the space requirements for the staging areas. A load plan that updates a target less frequently is likely to load more data at one go. Therefore, more space is required by the staging areas. A full refresh requires more space for the same reason. Estimated growth rates over time and retained history.

Determining Growth Projections


One way to estimate projections of data growth over time is to use scenario analysis. As an example, for scenario analysis of a sales tracking data mart you can use the number of sales transactions to be stored as the basis for the sizing estimate. In the first year, 10 million sales transactions are expected; this equates to 10 million fact-table records. Next, use the sales growth forecasts for the upcoming years for database growth calculations. That is, an annual sales growth rate of 10 percent translates into 11 million fact table records for the next year. At the end of five years, the fact table is likely to contain about 60 million records. You may want to calculate other estimates based on five-percent annual sales growth (case 1) and 20-percent annual sales growth (case 2). Multiple projections for best and worst case scenarios can be very helpful.

Oracle Table Space Prediction Model


Oracle (10g and onwards) provides a mechanism to predict the growth of a database. This feature can be useful in predicting table space requirements. Oracle incorporates a table space prediction model in the database engine that provides projected statistics for space used by a table. The following Oracle 10g query returns projected space usage statistics:

SELECT * FROM TABLE(DBMS_SPACE.object_growth_trend ('schema','tablename','TABLE')) ORDER BY timepoint; The results of this query are shown below: TIMEPOINT SPACE_USAGE SPACE_ALLOC QUALITY ------------------------------ ----------- ----------- -------------------11-APR-04 02.55.14.116000 PM 12-APR-04 02.55.14.116000 PM 13-APR-04 02.55.14.116000 PM 13-MAY-04 02.55.14.116000 PM 14-MAY-04 02.55.14.116000 PM 15-MAY-04 02.55.14.116000 PM 16-MAY-04 02.55.14.116000 PM 6372 6372 6372 6372 6372 6372 6372 65536 INTERPOLATED 65536 INTERPOLATED 65536 INTERPOLATED 65536 PROJECTED 65536 PROJECTED 65536 PROJECTED 65536 PROJECTED

The QUALITY column indicates the quality of the output as follows:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

114 of 954

GOOD - The data for the timepoint relates to data within the AWR repository with a timestamp within 10 percent of the interval. INTERPOLATED - The data for this timepoint did not meet the GOOD criteria but was based on data gathered before and after the timepoint. PROJECTED - The timepoint is in the future, so the data is estimated based on previous growth statistics.

Baseline Volumetric
Next, use the physical data models for the sources and the target architecture to develop a baseline sizing estimate. The administration guides for most DBMSs contain sizing guidelines for the various database structures such as tables, indexes, sort space, data files, log files, and database cache. Develop a detailed sizing using a worksheet inventory of the tables and indexes from the physical data model, along with field data types and field sizes. Various database products use different storage methods for data types. For this reason, be sure to use the database manuals to determine the size of each data type. Add up the field sizes to determine row size. Then use the data volume projections to determine the number of rows to multiply by the table size. The default estimate for index size is to assume same size as the table size. Also estimate the temporary space for sort operations. For data warehouse applications where summarizations are common, plan on large temporary spaces. The temporary space can be as much as 1.5 times larger than the largest table in the database. Another approach that is sometimes useful is to load the data architecture with representative data and determine the resulting database sizes. This test load can be a fraction of the actual data and is used only to gather basic sizing statistics. You then need to apply growth projections to these statistics. For example, after loading ten thousand sample records to the fact table, you determine the size to be 10MB. Based on the scenario analysis, you can expect this fact table to contain 60 million records after five years. So, the estimated size for the fact table is about 60GB [i.e., 10 MB * (60,000,000/10,000)]. Don't forget to add indexes and summary tables to the calculations.

Guesstimating
When there is not enough information to calculate an estimate as described above, use educated guesses and rules of thumb to develop as reasonable an estimate as possible.
q

If you dont have the source data model, use what you do know of the source data to estimate average field size and average number of fields in a row to determine table size. Based on your understanding of transaction volume over time, determine your growth metrics for each type of data and calculate out your source data volume (SDV) from table size and growth metrics. If your target data architecture is not completed so that you can determine table sizes, base your estimates on multiples of the SDV:
r

If it includes staging areas: add another SDV for any source subject area that you will
BEST PRACTICES 115 of 954

INFORMATICA CONFIDENTIAL

stage multiplied by the number of loads youll retain in staging.


r

If you intend to consolidate data into an operational data store, add the SDV multiplied by the number of loads to be retained in the ODS for historical purposes (e.g., keeping one years worth of monthly loads = 12 x SDV) Data warehouse architectures are based on the periodicity and granularity of the warehouse; this may be another SDV + (.3n x SDV where n = number of time periods loaded in the warehouse over time) If your data architecture includes aggregates, add a percentage of the warehouse volumetrics based on how much of the warehouse data will be aggregated and to what level (e.g., if the rollup level represents 10 percent of the dimensions at the details level, use 10 percent). Similarly, for data marts add a percentage of the data warehouse based on how much of the warehouse data is moved into the data mart. Be sure to consider the growth projections over time and the history to be retained in all of your calculations.

And finally, remember that there is always much more data than you expect so you may want to add a reasonable fudge-factor to the calculations for a margin of safety.

Last updated: 19-Jul-07 14:14

INFORMATICA CONFIDENTIAL

BEST PRACTICES

116 of 954

Deployment Groups Challenge


In selectively migrating objects from one repository folder to another, there is a need for a versatile and flexible mechanism that can overcome such limitations as confinement to a single source folder.

Description
Regulations such as Sarbanes-Oxley (SOX) and HIPAA require tracking, monitoring, and reporting of changes in information technology systems. Automation of change control processes using deployment groups and pmrep commands provide organizations with a means to comply with regulations for configuration management of software artifacts in a PowerCenter repository. Deployment Groups are containers that hold references to objects that need to be migrated. This includes objects such as mappings, mapplets, reusable transformations, sources, targets, workflows, sessions and tasks, as well as the object holders (i.e., the repository folders). Deployment groups are faster and more flexible than folder moves for incremental changes. In addition, they allow for migration rollbacks if necessary. Migrating a deployment group involves moving objects in a single copy operation from across multiple folders in the source repository into multiple folders in the target repository. When copying a deployment group, individual objects to be copied can be selected as opposed to the entire contents of a folder. There are two types of deployment groups - static and dynamic.
q

Static deployment groups contain direct references to versions of objects that need to be moved. Users explicitly add the version of the object to be migrated to the deployment group. If the set of deployment objects is not expected to change between deployments, static deployment groups can be created. Dynamic deployment groups contain a query that is executed at the time of deployment. The results of the query (i.e., object versions in the repository) are then selected and copied to the deployment group. If the set of deployment objects is expected to change frequently between deployments, dynamic deployment groups should be used.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

117 of 954

Dynamic deployment groups are generated from a query. While any available criteria can be used, it is advisable to have developers use labels to simplify the query. For more information, refer to the Strategies for Labels section of Using PowerCenter Labels. When generating a query for deployment groups with mappings and mapplets that contain non-reusable objects, in addition to specific selection criteria, a query condition should be used. The query must include a condition for Is Reusable and use a qualifier of either Reusable and Non-Reusable. Without this qualifier, the deployment may encounter errors if there are non-reusable objects held within the mapping or mapplet. A deployment group exists in a specific repository. It can be used to move items to any other accessible repository/folder. A deployment group maintains a history of all migrations it has performed. It tracks what versions of objects were moved from which folders in which source repositories, and into which folders in which target repositories those versions were copied (i.e., it provides a complete audit trail of all migrations performed). Given that the deployment group knows what it moved and to where, then if necessary, an administrator can have the deployment group undo the most recent deployment, reverting the target repository to its pre-deployment state. Using labels (as described in the Using PowerCenter Labels Best Practice) allows objects in the subsequent repository to be tracked back to a specific deployment. It is important to note that the deployment group only migrates the objects it contains to the target repository/folder. It does not, itself, move to the target repository. It still resides in the source repository.

Deploying via the GUI


Migrations can be performed via the GUI or the command line (pmrep). In order to migrate objects via the GUI, simply drag a deployment group from the repository it resides in onto the target repository where the referenced objects are to be moved. The Deployment Wizard appears and steps the user through the deployment process. Once the wizard is complete, the migration occurs, and the deployment history is created.

Deploying via the Command Line


Alternatively, the PowerCenter pmrep command can be used to automate both Folder Level deployments (e.g., in a non-versioned repository) and deployments using Deployment Groups. The commands DeployFolder and DeployDeploymentGroup in pmrep are used respectively for these purposes. Whereas deployment via the GUI requires stepping through a wizard and answering a series of questions to deploy, the command-line deployment requires an XML control file that contains the same

INFORMATICA CONFIDENTIAL

BEST PRACTICES

118 of 954

information that the wizard requests. This file must be present before the deployment is executed. The following steps can be used to create a script to wrap pmrep commands and automate PowerCenter deployments: 1. Use pmrep ListObjects to return the object metadata to be parsed in another pmrep command. 2. Use pmrep CreateDeploymentGroup to create a dynamic or static deployment group. 3. Use pmrep ExecuteQuery to output the results to a persistent input file. This input file can also be used for AddToDeploymentGroup command. 4. Use DeployDeploymentGroup to copy a deployment group to a different repository. A control file with all the specifications is required for this command. Additionally, a web interface can be built for entering/approving/rejecting code migration requests. This can provide additional traceability and reporting capabilities to the automation of PowerCenter code migrations.

Considerations for Deployment and Deployment Groups Simultaneous Multi-Phase Projects


If multiple phases of a project are being developed simultaneously in separate folders, it is possible to consolidate them by mapping folders appropriately through the deployment group migration wizard. When migrating with deployment groups in this way, the override buttons in the migration wizard are used to select specific folder mappings.

Rolling Back a Deployment


Deployment groups help to ensure that there is a back-out methodology and that the latest version of a deployment can be rolled back. To do this: In the target repository (where the objects were migrated to), go to: Versioning>>Deployment>>History>>View History>>Rollback. The rollback purges all objects (of the latest version) that were in the deployment group. Initiate a rollback on a deployment in order to roll back only the latest versions of

INFORMATICA CONFIDENTIAL

BEST PRACTICES

119 of 954

the objects. The rollback ensures that the check-in time for the repository objects is the same as the deploy time. Also, pmrep command RollBackDeployment can be used for automating rollbacks. Remember that you cannot rollback part of the deployment, you will have to rollback all the objects in a deployment group.

Managing Repository Size


As objects are checked in and objects are deployed to target repositories, the number of object versions in those repositories increases, as does the size of the repositories. In order to manage repository size, use a combination of Check-in Date and Latest Status (both are query parameters) to purge the desired versions from the repository and retain only the very latest version. Also all the deleted versions of the objects should be purged to reduce the size of the repository. If it is necessary to keep more than the latest version, labels can be included in the query. These labels are ones that have been applied to the repository for the specific purpose of identifying objects for purging.

Off-Shore, On-Shore Migration


In an off-shore development environment to an on-shore migration situation, other aspects of the computing environment may make it desirable to generate a dynamic deployment group. Instead of migrating the group itself to the next repository, a query can be used to select the objects for migration and save them to a single XML file which can be then be transmitted to the on-shore environment through alternative methods. If the on-shore repository is versioned, it activates the import wizard as if a deployment group was being received.

Code Migration from Versioned Repository to a Non-Versioned Repository


In some instances, it may be desirable to migrate objects to a non-versioned repository from a versioned repository. Note that when migrating in this manner, this changes the wizards used, and that the export from the versioned repository must take place using XML export.

Last updated: 27-May-08 13:20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

120 of 954

Migration Procedures - PowerCenter Challenge


Develop a migration strategy that ensures clean migration between development, test, quality assurance (QA), and production environments, thereby protecting the integrity of each of these environments as the system evolves.

Description
Ensuring that an application has a smooth migration process between development, QA, and production environments is essential for the deployment of an application. Deciding which migration strategy works best for a project depends on two primary factors.
q

How is the PowerCenter repository environment designed? Are there individual repositories for development, QA, and production or are there just one or two environments that share one or all of these phases. How has the folder architecture been defined?

Each of these factors plays a role in determining the migration procedure that is most beneficial to the project. PowerCenter offers flexible migration options that can be adapted to fit the need of each application. PowerCenter migration options include repository migration, folder migration, object migration, and XML import/export. In versioned PowerCenter repositories, users can also use static or dynamic deployment groups for migration, which provides the capability to migrate any combination of objects within the repository with a single command. This Best Practice is intended to help the development team decide which technique is most appropriate for the project. The following sections discuss various options that are available, based on the environment and architecture selected. Each section describes the major advantages of its use, as well as its disadvantages.

Repository Environments
The following section outlines the migration procedures for standalone and distributed repository environments. The distributed environment section touches on several migration architectures, outlining the pros and cons of each. Also, please note that any methods described in the Standalone section may also be used in a Distributed environment.

Standalone Repository Environment


In a standalone environment, all work is performed in a single PowerCenter repository that serves as the metadata store. Separate folders are used to represent the development, QA, and production workspaces and segregate work. This type of architecture within a single repository ensures seamless migration from development to QA, and from QA to production. The following example shows a typical architecture. In this example, the company has chosen to create separate development folders for each of the individual developers for development and unit test purposes. A single shared or common development folder, SHARED_MARKETING_DEV, holds all of the common objects, such as sources, targets, and reusable mapplets. In addition, two test folders are created for QA purposes. The first contains all of the unit-tested mappings from the development folder. The second is a common or shared folder that contains all of the tested shared objects. Eventually, as the following paragraphs explain, two production folders will also be built.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

121 of 954

Proposed Migration Process Single Repository


DEV to TEST Object Level Migration Now that we've described the repository architecture for this organization, let's discuss how it will migrate mappings to test, and then eventually to production. After all mappings have completed their unit testing, the process for migration to test can begin. The first step in this process is to copy all of the shared or common objects from the SHARED_MARKETING_DEV folder to the SHARED_MARKETING_TEST folder. This can be done using one of two methods:
q

The first, and most common method, is object migration via an object copy. In this case, a user opens the SHARED_MARKETING_TEST folder and drags the object from the SHARED_MARKETING_DEV into the appropriate workspace (i.e., Source Analyzer, Warehouse Designer, etc.). This is similar to dragging a file from one folder to another using Windows Explorer. The second approach is object migration via object XML import/export. A user can export each of the objects in the SHARED_MARKETING_DEV folder to XML, and then re-import each object into the SHARED_MARKETING_TEST via XML import. With the XML import/export, the XML files can be uploaded to a third-party versioning tool, if the organization has standardized on such a tool. Otherwise, versioning can be enabled in PowerCenter. Migrations with versioned PowerCenter repositories is covered later in this document.

After you've copied all common or shared objects, the next step is to copy the individual mappings from each development folder into the MARKETING_TEST folder. Again, you can use either of the two object-level migration methods described above to copy the mappings to the folder, although the XML import/export method is the most intuitive method for resolving shared object conflicts. However, the migration method is slightly different here when you're copying the mappings because you must ensure that the shortcuts in the mapping are associated with the SHARED_MARKETING_TEST folder. Designer prompts the user to choose the correct shortcut folder that you created in the previous example, which point to the SHARED_MARKETING_TEST (see image below). You can then continue the migration process until all mappings have been successfully migrated. In PowerCenter 7 and later versions, you can export multiple objects into a single XML file, and then import them at the same time.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

122 of 954

The final step in the process is to migrate the workflows that use those mappings. Again, the object-level migration can be completed either through drag-and-drop or by using XML import/export. In either case, this process is very similar to the steps described above for migrating mappings, but differs in that the Workflow Manager provides a Workflow Copy Wizard to guide you through the process. The following steps outline the full process for successfully copying a workflow and all of its associated tasks. 1. The Wizard prompts for the name of the new workflow. If a workflow with the same name exists in the destination folder, the Wizard prompts you to rename it or replace it. If no such workflow exists, a default name is used. Then click Next to continue the copy process. 2. The next step for each task is to see if it exists (as shown below). If the task is present, you can rename or replace the current one. If it does not exist, then the default name is used (see below). Then click Next.

3. Next, the Wizard prompts you to select the mapping associated with each session task in the workflow. Select the mapping and continue by clicking Next".

INFORMATICA CONFIDENTIAL

BEST PRACTICES

123 of 954

4. If connections exist in the target repository, the Wizard prompts you to select the connection to use for the source and target. If no connections exist, the default settings are used. When this step is completed, click "Finish" and save the work.

Initial Migration New Folders Created


The move to production is very different for the initial move than for subsequent changes to mappings and workflows. Since the repository only contains folders for development and test, we need to create two new folders to house the production-ready objects. Create these folders after testing of the objects in SHARED_MARKETING_TEST and MARKETING_TEST has been approved. The following steps outline the creation of the production folders and, at the same time, address the initial test to production migration. 1. Open the PowerCenter Repository Manager client tool and log into the repository. 2. To make a shared folder for the production environment, highlight the SHARED_MARKETING_TEST folder, drag it, and drop it on the repository name. 3. The Copy Folder Wizard appears to guide you through the copying process.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

124 of 954

4. The first Wizard screen asks if you want to use the typical folder copy options or the advanced options. In this example, we'll use the advanced options.

5. The second Wizard screen prompts you to enter a folder name. By default, the folder name that appears on this screen is the folder name followed by the date. In this case, enter the name as SHARED_MARKETING_PROD.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

125 of 954

6. The third Wizard screen prompts you to select a folder to override. Because this is the first time you are transporting the folder, you wont need to select anything.

7. The final screen begins the actual copy process. Click "Finish" when the process is complete.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

126 of 954

Repeat this process to create the MARKETING_PROD folder. Use the MARKETING_TEST folder as the original to copy and associate the shared objects with the SHARED_MARKETING_PROD folder that you just created. At the end of the migration, you should have two additional folders in the repository environment for production: SHARED_MARKETING_PROD and MARKETING_ PROD (as shown below). These folders contain the initially migrated objects. Before you can actually run the workflow in these production folders, you need to modify the session source and target connections to point to the production environment.

When you copy or replace a PowerCenter repository folder, the Copy Wizard copies the permissions for the folder owner to the target folder. The wizard does not copy permissions for users, groups, or all others in the repository to the target folder. Previously, the Copy Wizard copied the permissions for the folder owner, owners group, and all users in the repository to the target folder.

Incremental Migration Object Copy Example


Now that the initial production migration is complete, let's take a look at how future changes will be migrated into the folder.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

127 of 954

Any time an object is modified, it must be re-tested and migrated into production for the actual change to occur. These types of changes in production take place on a case-by-case or periodically-scheduled basis. The following steps outline the process of moving these objects individually. 1. Log into PowerCenter Designer. Open the destination folder and expand the source folder. Click on the object to copy and drag-and-drop it into the appropriate workspace window. 2. Because this is a modification to an object that already exists in the destination folder, Designer prompts you to choose whether to Rename or Replace the object (as shown below). Choose the option to Replace the object.

3. In PowerCenter 7 and later versions, you can choose to compare conflicts whenever migrating any object in Designer or Workflow Manager. By comparing the objects, you can ensure that the changes that you are making are what you intend. See below for an example of the mapping compare window.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

128 of 954

4. After the object has been successfully copied, save the folder so the changes can take place. 5. The newly copied mapping is now tied to any sessions that the replaced mapping was tied to. 6. Log into Workflow Manager and make the appropriate changes to the session or workflow so it can update itself with the changes.

Standalone Repository Example


In this example, we look at moving development work to QA and then from QA to production, using multiple development folders for each developer, with the test and production folders divided into the data mart they represent. For this example, we focus solely on the MARKETING_DEV data mart, first explaining how to move objects and mappings from each individual folder to the test folder and then how to move tasks, worklets, and workflows to the new area. Follow these steps to copy a mapping from Development to QA: 1. If using shortcuts, first follow these steps; if not using shortcuts, skip to step 2 r Copy the tested objects from the SHARED_MARKETING_DEV folder to the SHARED_MARKETING_TEST folder.
r

Drag all of the newly copied objects from the SHARED_MARKETING_TEST folder to MARKETING_TEST. Save your changes.

2. Copy the mapping from Development into Test. r In the PowerCenter Designer, open the MARKETING_TEST folder, and drag and drop the mapping from each development folder into the MARKETING_TEST folder.
r

When copying each mapping in PowerCenter, Designer prompts you to either Replace, Rename, or Reuse the object, or Skip for each reusable object, such as source and target definitions. Choose to Reuse the object for all shared objects in the mappings copied into the MARKETING_TEST folder.
BEST PRACTICES 129 of 954

INFORMATICA CONFIDENTIAL

Save your changes.

3. If a reusable session task is being used, follow these steps. Otherwise, skip to step 4. r In the PowerCenter Workflow Manager, open the MARKETING_TEST folder and drag and drop each reusable session from the developers folders into the MARKETING_TEST folder. A Copy Session Wizard guides you through the copying process.
r

Open each newly copied session and click on the Source tab. Change the source to point to the source database for the Test environment. Click the Target tab. Change each connection to point to the target database for the Test environment. Be sure to double-check the workspace from within the Target tab to ensure that the load options are correct. Save your changes.

4. While the MARKETING_TEST folder is still open, copy each workflow from Development to Test. r Drag each workflow from the development folders into the MARKETING_TEST folder. The Copy Workflow Wizard appears. Follow the same steps listed above to copy the workflow to the new folder.
r

As mentioned earlier, in PowerCenter 7 and later versions, the Copy Wizard allows you to compare conflicts from within Workflow Manager to ensure that the correct migrations are being made. Save your changes.

5. Implement the appropriate security. r In Development, the owner of the folders should be a user(s) in the development group.
r r r

In Test, change the owner of the test folder to a user(s) in the test group. In Production, change the owner of the folders to a user in the production group. Revoke all rights to Public other than Read for the production folders.

Rules to Configure Folder and Global Object Permissions Rules in 8.5 The folder or global object owner or a user assigned the Administrator role for the Repository Service can grant folder and global object permissions. Rules in Previous Versions Users with the appropriate repository privileges could grant folder and global object permissions.

Permissions can be granted to users, groups, and all others in Permissions could be granted to the owner, owners group, the repository. and all others in the repository. The folder or global object owner and a user assigned the Administrator role for the Repository Service have all permissions which you cannot change. You could change the permissions for the folder or global object owner.

Disadvantages of a Single Repository Environment


The biggest disadvantage or challenge with a single repository environment is migration of repository objects with respect to database connections. When migrating objects from Dev to Test to Prod you cant use the same database connection as those that will be pointing to dev or test environment. A single repository structure can also create confusion as the same users and groups exist in all environments and the number of folders can increase exponentially.

Distributed Repository Environment


INFORMATICA CONFIDENTIAL BEST PRACTICES 130 of 954

A distributed repository environment maintains separate, independent repositories, hardware, and software for development, test, and production environments. Separating repository environments is preferable for handling development to production migrations. Because the environments are segregated from one another, work performed in development cannot impact QA or production. With a fully distributed approach, separate repositories function much like the separate folders in a standalone environment. Each repository has a similar name, like the folders in the standalone environment. For instance, in our Marketing example we would have three repositories, INFADEV, INFATEST, and INFAPROD. In the following example, we discuss a distributed repository architecture. There are four techniques for migrating from development to production in a distributed repository architecture, with each involving some advantages and disadvantages.
q q q q

Repository Copy Folder Copy Object Copy Deployment Groups

Repository Copy
So far, this document has covered object-level migrations and folder migrations through drag-and-drop object copying and object XML import/export. This section discusses migrations in a distributed repository environment through repository copies. The main advantages of this approach are:
q

The ability to copy all objects (i.e., mappings, workflows, mapplets, reusable transformation, etc.) at once from one environment to another. The ability to automate this process using pmrep commands, thereby eliminating many of the manual processes that users typically perform. The ability to move everything without breaking or corrupting any of the objects.

This approach also involves a few disadvantages.


INFORMATICA CONFIDENTIAL BEST PRACTICES 131 of 954

The first is that everything is moved at once (which is also an advantage). The problem with this is that everything is moved -- ready or not. For example, we may have 50 mappings in QA, but only 40 of them are production-ready. The 10 untested mappings are moved into production along with the 40 production-ready mappings, which leads to the second disadvantage. Significant maintenance is required to remove any unwanted or excess objects. There is also a need to adjust server variables, sequences, parameters/variables, database connections, etc. Everything must be set up correctly before the actual production runs can take place. Lastly, the repository copy process requires that the existing Production repository be deleted, and then the Test repository can be copied. This results in a loss of production environment operational metadata such as load statuses, session run times, etc. High-performance organizations leverage the value of operational metadata to track trends over time related to load success/failure and duration. This metadata can be a competitive advantage for organizations that use this information to plan for future growth.

q q

Now that we've discussed the advantages and disadvantages, we'll look at three ways to accomplish the Repository Copy method:
q q q

Copying the Repository Repository Backup and Restore PMREP

Copying the Repository


Copying the Test repository to Production through the GUI client tools is the easiest of all the migration methods. First, ensure that all users are logged out of the destination repository and then connect to the PowerCenter Repository Administration Console (as shown below).

If the Production repository already exists, you must delete the repository before you can copy the Test repository. Before you can delete the repository, you must run the repository in the exclusive mode. 1. Click on the INFA_PROD Repository on the left pane to select it and change the running mode to exclusive mode by clicking on the edit button on the right pane under the properties tab.
INFORMATICA CONFIDENTIAL BEST PRACTICES 132 of 954

2. Delete the Production repository by selecting it and choosing Delete from the context menu.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

133 of 954

3. Click on the Action drop-down list and choose Copy contents from

INFORMATICA CONFIDENTIAL

BEST PRACTICES

134 of 954

4. In the new window, choose the domain name, repository service INFA_TEST from the drop-down menu. Enter the username and password of the Test repository.

5. Click OK to begin the copy process. 6. When you've successfully copied the repository to the new location, exit from the PowerCenter Administration
INFORMATICA CONFIDENTIAL BEST PRACTICES 135 of 954

Console. 7. In the Repository Manager, double-click on the newly copied repository and log-in with a valid username and password. 8. Verify connectivity, then highlight each folder individually and rename them. For example, rename the MARKETING_TEST folder to MARKETING_PROD, and the SHARED_MARKETING_TEST to SHARED_MARKETING_PROD. 9. Be sure to remove all objects that are not pertinent to the Production environment from the folders before beginning the actual testing process. 10. When this cleanup is finished, you can log into the repository through the Workflow Manager. Modify the server information and all connections so they are updated to point to the new Production locations for all existing tasks and workflows.

Repository Backup and Restore


Backup and Restore Repository is another simple method of copying an entire repository. This process backs up the repository to a binary file that can be restored to any new location. This method is preferable to the repository copy process because if any type of error occurs, the file is backed up to the binary file on the repository server. From 8.5 onwards, security information is maintained at the domain level. Before you back up a repository and restore it in a different domain, verify that users and groups with privileges for the source Repository Service exist in the target domain. The Service Manager periodically synchronizes the list of users and groups in the repository with the users and groups in the domain configuration database. During synchronization, users and groups that do not exist in the target domain are deleted from the repository. You can use infacmd to export users and groups from the source domain and import them into the target domain. Use infacmd ExportUsersAndGroups to export the users and groups to a file. Use infacmd ImportUsersAndGroups to import the users and groups from the file to a different PowerCenter domain The following steps outline the process of backing up and restoring the repository for migration. 1. Launch the PowerCenter Administration Console, and highlight the INFA_TEST repository service. Select Action -> Backup Contents from the drop-down menu.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

136 of 954

2. A screen appears and prompts you to supply a name for the backup file as well as the Administrator username and password. The file is saved to the Backup directory within the repository servers home directory.

3. After you've selected the location and file name, click OK to begin the backup process. 4. The backup process creates a .rep file containing all repository information. Stay logged into the Manage Repositories screen. When the backup is complete, select the repository connection to which the backup will be restored to (i.e., the Production repository).
INFORMATICA CONFIDENTIAL BEST PRACTICES 137 of 954

5. The system will prompt you to supply a username, password, and the name of the file to be restored. Enter the appropriate information and click OK. When the restoration process is complete, you must repeat the steps listed in the copy repository option in order to delete all of the unused objects and renaming of the folders.

PMREP
Using the PMREP commands is essentially the same as the Backup and Restore Repository method except that it is run from the command line rather than through the GUI client tools. pmrep is installed in the PowerCenter Client and PowerCenter Services bin directories. PMREP utilities can be used from the Informatica Server or from any client machine connected to the server. Refer to the Repository Manager Guide for a list of PMREP commands. PMREP backup backs up the repository to the file specified with the -o option. You must provide the backup file name. Use this command when the repository is running. You must be connected to a repository to use this command. The BackUp command uses the following syntax: backup -o <output_file_name> [-d <description>] [-f (overwrite existing output file)] [-b (skip workflow and session logs)] [-j (skip deploy group history)] [-q (skip MX data)] [-v (skip task statistics)] The following is a sample of the command syntax used within a Windows batch file to connect to and backup a repository. Using this code example as a model, you can write scripts to be run on a daily basis to perform functions
INFORMATICA CONFIDENTIAL BEST PRACTICES 138 of 954

such as connect, backup, restore, etc: backupproduction.bat REM This batch file uses pmrep to connect to and back up the repository Production on the server Central @echo off echo Connecting to Production repository... <Informatica Installation Directory>\Server\bin\pmrep connect -r INFAPROD -n Administrator -x Adminpwd h infarepserver o 7001 echo Backing up Production repository... <Informatica Installation Directory>\Server\bin\pmrep backup -o c:\backup\Production_backup.rep Alternatively, the following steps can be used: 1. Use infacmd commands to run repository service in Exclusive mode 2. Use pmrep backup command to backup the source repository 3. Use pmrep delete command to delete the content of target repository (if contect already exists in the target repository) 4. Use pmrep restore command to restore the backup file into target repostiory

Post-Repository Migration Cleanup


After you have used one of the repository migration procedures to migrate into Production, follow these steps to convert the repository to Production: 1. Disable workflows that are not ready for Production or simply delete the mappings, tasks, and workflows.
r

Disable the workflows not being used in the Workflow Manager by opening the workflow properties, then checking the Disabled checkbox under the General tab. Delete the tasks not being used in the Workflow Manager and the mappings in the Designer

2. Modify the database connection strings to point to the production sources and targets.
r r

In the Workflow Manager, select Relational connections from the Connections menu. Edit each relational connection by changing the connect string to point to the production sources and targets. If you are using lookup transformations in the mappings and the connect string is anything other than $SOURCE or $TARGET, you will need to modify the connect strings appropriately.

3. Modify the pre- and post-session commands and SQL as necessary.


r

In the Workflow Manager, open the session task properties, and from the Components tab make the required changes to the pre- and post-session scripts.

4. Implement appropriate security, such as:


INFORMATICA CONFIDENTIAL BEST PRACTICES 139 of 954

r r r r

In Development, ensure that the owner of the folders is a user in the development group. In Test, change the owner of the test folders to a user in the test group. In Production, change the owner of the folders to a user in the production group. Revoke all rights to Public other than Read for the Production folders.

Folder Copy
Although deployment groups are becoming a very popular migration method, the folder copy method has historically been the most popular way to migrate in a distributed environment. Copying an entire folder allows you to quickly promote all of the objects located within that folder. All source and target objects, reusable transformations, mapplets, mappings, tasks, worklets and workflows are promoted at once. Because of this, however, everything in the folder must be ready to migrate forward. If some mappings or workflows are not valid, then developers (or the Repository Administrator) must manually delete these mappings or workflows from the new folder after the folder is copied. The three advantages of using the folder copy method are:
q

The Repository Managers Folder Copy Wizard makes it almost seamless to copy an entire folder and all the objects located within it. If the project uses a common or shared folder and this folder is copied first, then all shortcut relationships are automatically converted to point to this newly copied common or shared folder. All connections, sequences, mapping variables, and workflow variables are copied automatically.

The primary disadvantage of the folder copy method is that the repository is locked while the folder copy is being performed. Therefore, it is necessary to schedule this migration task during a time when the repository is least utilized. Remember that a locked repository means than no jobs can be launched during this process. This can be a serious consideration in real-time or near real-time environments. The following example steps through the process of copying folders from each of the different environments. The first example uses three separate repositories for development, test, and production. 1. If using shortcuts, follow these sub steps; otherwise skip to step 2:
q q q q q

Open the Repository Manager client tool. Connect to both the Development and Test repositories. Highlight the folder to copy and drag it to the Test repository. The Copy Folder Wizard appears to step you through the copy process. When the folder copy process is complete, open the newly copied folder in both the Repository Manager and Designer to ensure that the objects were copied properly.

2. Copy the Development folder to Test. If you skipped step 1, follow these sub-steps:
q q q

Open the Repository Manager client tool. Connect to both the Development and Test repositories. Highlight the folder to copy and drag it to the Test repository. The Copy Folder Wizard will appear.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

140 of 954

3. Follow these steps to ensure that all shortcuts are reconnected.


q q

Use the advanced options when copying the folder across. Select Next to use the default name of the folder

4. If the folder already exists in the destination repository, choose to replace the folder.

The following screen appears to prompt you to select the folder where the new shortcuts are located.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

141 of 954

In a situation where the folder names do not match, a folder compare will take place. The Copy Folder Wizard then completes the folder copy process. Rename the folder as appropriate and implement the security. 5. When testing is complete, repeat the steps above to migrate to the Production repository. When the folder copy process is complete, log onto the Workflow Manager and change the connections to point to the appropriate target location. Ensure that all tasks updated correctly and that folder and repository security is modified for test and production.

Object Copy
Copying mappings into the next stage in a networked environment involves many of the same advantages and disadvantages as in the standalone environment, but the process of handling shortcuts is simplified in the networked environment. For additional information, see the earlier description of Object Copy for the standalone environment. One advantage of Object Copy in a distributed environment is that it provides more granular control over objects. Two distinct disadvantages of Object Copy in a distributed environment are:
q q

Much more work to deploy an entire group of objects Shortcuts must exist prior to importing/copying mappings

Below are the steps to complete an object copy in a distributed repository environment: 1. If using shortcuts, follow these sub-steps, otherwise skip to step 2:
q q

In each of the distributed repositories, create a common folder with the exact same name and case. Copy the shortcuts into the common folder in Production, making sure the shortcut has the exact same name.

2. Copy the mapping from the Test environment into Production.


INFORMATICA CONFIDENTIAL BEST PRACTICES 142 of 954

In the Designer, connect to both the Test and Production repositories and open the appropriate folders in each. Drag-and-drop the mapping from Test into Production. During the mapping copy process, PowerCenter 7 and later versions allow a comparison of this mapping to an existing copy of the mapping already in Production. Note that the ability to compare objects is not limited to mappings, but is available for all repository objects including workflows, sessions, and tasks.

q q

3. Create or copy a workflow with the corresponding session task in the Workflow Manager to run the mapping (first ensure that the mapping exists in the current repository).
q q

If copying the workflow, follow the Copy Wizard. If creating the workflow, add a session task that points to the mapping and enter all the appropriate information.

4. Implement appropriate security.


q q q q

In Development, ensure the owner of the folders is a user in the development group. In Test, change the owner of the test folders to a user in the test group. In Production, change the owner of the folders to a user in the production group. Revoke all rights to Public other than Read for the Production folders.

Deployment Groups
For versioned repositories, the use of Deployment Groups for migrations between distributed environments allows the most flexibility and convenience. With Deployment Groups, you can migrate individual objects as you would in an object copy migration, but can also have the convenience of a repository- or folder-level migration as all objects are deployed at once. The objects included in a deployment group have no restrictions and can come from one or multiple folders. Additionally, for additional convenience, you can set up a dynamic deployment group that allows the objects in the deployment group to be defined by a repository query, rather than being added to the deployment group manually. Lastly, because deployment groups are available on versioned repositories, they also have the ability to be rolled back, reverting to the previous versions of the objects, when necessary.

Advantages of Using Deployment Groups


q q q q

Backup and restore of the Repository needs to be performed only once. Copying a Folder replaces the previous copy. Copying a Mapping allows for different names to be used for the same object. Uses for Deployment Groups
r r r r r

Deployment Groups are containers that hold references to objects that need to be migrated. Allows for version-based object migration. Faster and more flexible than folder moves for incremental changes. Allows for migration rollbacks Allows specifying individual objects to copy, rather than the entire contents of a folder.

Types of Deployment Groups


q

Static
r

Contain direct references to versions of objects that need to be moved.


BEST PRACTICES 143 of 954

INFORMATICA CONFIDENTIAL

Users explicitly add the version of the object to be migrated to the deployment group.

Dynamic
r r

Contain a query that is executed at the time of deployment. The results of the query (i.e. object versions in the repository) are then selected and copied to the target repository

Pre-Requisites
Create required folders in the Target Repository

Creating Labels
A label is a versioning object that you can associate with any versioned object or group of versioned objects in a repository.
q

Advantages
r r r r

Tracks versioned objects during development. Improves query results. Associates groups of objects for deployment. Associates groups of objects for import and export.

Create label
r r r r r

Create labels through the Repository Manager. After creating the labels, go to edit mode and lock them. The "Lock" option is used to prevent other users from editing or applying the label. This option can be enabled only when the label is edited. Some Standard Label examples are:
s s s s s

Development Deploy_Test Test Deploy_Production Production

Apply Label
r r

Create a query to identify the objects that are needed to be queried. Run the query and apply the labels.

Note: By default, the latest version of the object gets labeled.

Queries
A query is an object used to search for versioned objects in the repository that meet specific conditions.
q

Advantages
r

Tracks objects during development


BEST PRACTICES 144 of 954

INFORMATICA CONFIDENTIAL

r r r

Associates a query with a deployment group Finds deleted objects you want to recover Finds groups of invalidated objects you want to validate

Create a query
r

The Query Browser allows you to create, edit, run, or delete object queries

Execute a query
r r

Execute through Query Browser EXECUTE QUERY: ExecuteQuery -q query_name -t query_type -u persistent_output_file_name -a append -c column_separator -r end-of-record_separator -l end-oflisting_indicator -b verbose

Creating a Deployment Group


Follow these steps to create a deployment group: 1. Launch the Repository Manager client tool and log in to the source repository. 2. Expand the repository, right-click on Deployment Groups and choose New Group.

3. In the dialog window, give the deployment group a name, and choose whether it should be static or dynamic. In this example, we are creating a static deployment group. Click OK.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

145 of 954

Adding Objects to a Static Deployment Group


Follow these steps to add objects to a static deployment group: 1. In Designer, Workflow Manager, or Repository Manger, right-click an object that you want to add to the deployment group and choose Versioning -> View History. The View History window appears.

2. In the View History window, right-click the object and choose Add to Deployment Group.
INFORMATICA CONFIDENTIAL BEST PRACTICES 146 of 954

3. In the Deployment Group dialog window, choose the deployment group that you want to add the object to, and click OK.

4. In the final dialog window, choose whether you want to add dependent objects. In most cases, you will want to add dependent objects to the deployment group so that they will be migrated as well. Click OK.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

147 of 954

NOTE: The All Dependencies option should be used for any new code that is migrating forward. However, this option can cause issues when moving existing code forward because All Dependencies also flags shortcuts. During the deployment, PowerCenter tries to re-insert or replace the shortcuts. This does not work, and causes the deployment to fail. The object will be added to the deployment group at this time. Although the deployment group allows the most flexibility, the task of adding each object to the deployment group is similar to the effort required for an object copy migration. To make deployment groups easier to use, PowerCenter allows the capability to create dynamic deployment groups.

Adding Objects to a Dynamic Deployment Group


Dynamic Deployment groups are similar in function to static deployment groups, but differ in the way that objects are added. In a static deployment group, objects are manually added one by one. In a dynamic deployment group, the contents of the deployment group are defined by a repository query. Dont worry about the complexity of writing a repository query, it is quite simple and aided by the PowerCenter GUI interface. Follow these steps to add objects to a dynamic deployment group: 1. First, create a deployment group, just as you did for a static deployment group, but in this case, choose the dynamic option. Also, select the Queries button.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

148 of 954

2. The Query Browser window appears. Choose New to create a query for the dynamic deployment group.

3. In the Query Editor window, provide a name and query type (Shared). Define criteria for the objects that should be migrated. The drop-down list of parameters lets you choose from 23 predefined metadata categories. In this case, the developers have assigned the RELEASE_20050130 label to all objects that need to be migrated, so the query is defined as Label Is Equal To RELEASE_20050130. The creation and application of labels are discussed in Using PowerCenter Labels.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

149 of 954

4. Save the Query and exit the Query Editor. Click OK on the Query Browser window, and close the Deployment Group editor window.

Executing a Deployment Group Migration


A Deployment Group migration can be executed through the Repository Manager client tool, or through the pmrep command line utility. With the client tool, you simply drag the deployment group from the source repository and drop it on the destination repository. This opens the Copy Deployment Group Wizard, which guides you through the stepby-step options for executing the deployment group.

Rolling Back a Deployment


To roll back a deployment, you must first locate the Deployment via the TARGET Repositories menu bar (i. e., Deployments -> History -> View History -> Rollback).

Automated Deployments
For the optimal migration method, you can set up a UNIX shell or Windows batch script that calls the pmrep DeployDeploymentGroup command, which can execute a deployment group migration without human intevention. This is ideal since the deployment group allows ultimate flexibility and convenience as the script can be scheduled to run overnight, thereby causing minimal impact on developers and the PowerCenter administrator. You can also use the pmrep utility to automate importing objects via XML.

Recommendations
Informatica recommends using the following process when running in a three-tiered environment with development, test, and production servers.
INFORMATICA CONFIDENTIAL BEST PRACTICES 150 of 954

Non-Versioned Repositories
For migrating from development into test, Informatica recommends using the Object Copy method. This method gives you total granular control over the objects that are being moved. It also ensures that the latest development mappings can be moved over manually as they are completed. For recommendations on performing this copy procedure correctly, see the steps listed in the Object Copy section.

Versioned Repositories
For versioned repositories, Informatica recommends using the Deployment Groups method for repository migration in a distributed repository environment. This method provides the greatest flexibility in that you can promote any object from within a development repository (even across folders) into any destination repository. Also, by using labels, dynamic deployment groups, and the enhanced pmrep command line utility, the use of the deployment group migration method results in automated migrations that can be executed without manual intervention.

Third-Party Versioning
Some organizations have standardized on third-party version control software. PowerCenters XML import/export functionality offers integration with such software and provides a means to migrate objects. This method is most useful in a distributed environment because objects can be exported into an XML file from one repository and imported into the destination repository. The XML Object Copy Process allows you to copy nearly all repository objects, including sources, targets, reusable transformations, mappings, mapplets, workflows, worklets, and tasks. Beginning with PowerCenter 7 and later versions, the export/import functionality allows the export/import of multiple objects to a single XML file. This can significantly cut down on the work associated with object level XML import/export. The following steps outline the process of exporting the objects from source repository and importing them into the destination repository:

Exporting
1. From Designer or Workflow Manager, login to the source repository. Open the folder and highlight the object to be exported. 2. Select Repository -> Export Objects
INFORMATICA CONFIDENTIAL BEST PRACTICES 151 of 954

3. The system prompts you to select a directory location on the local workstation. Choose the directory to save the file. Using the default name for the XML file is generally recommended. 4. Open Windows Explorer and go to the C:\Program Files\Informatica PowerCenter 7 and later versions x \Client directory. (This may vary depending on where you installed the client tools.) 5. Find the powrmart.dtd file, make a copy of it, and paste the copy into the directory where you saved the XML file. 6. Together, these files are now ready to be added to the version control software

Importing
Log into Designer or the Workflow Manager client tool and login to the destination repository. Open the folder where the object is to be imported. 1. Select Repository -> Import Objects. 2. The system prompts you to select a directory location and file to import into the repository. 3. The following screen appears with the steps for importing the object.

4. Select the mapping and add it to the Objects to Import list.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

152 of 954

5. Click "Next", and then click "Import". Since the shortcuts have been added to the folder, the mapping will now point to the new shortcuts and their parent folder. 6. It is important to note that the pmrep command line utility was greatly enhanced in PowerCenter 7 and later versions, allowing the activities associated with XML import/export to be automated through pmrep. 7. Click on the destination repository service on the left pane and choose the Action drop-down list box -> Restore. Remember, if the destination repository has content, it has to be deleted prior to restoring).

Last updated: 04-Jun-08 16:18

INFORMATICA CONFIDENTIAL

BEST PRACTICES

153 of 954

Migration Procedures - PowerExchange Challenge


To facilitate the migration of PowerExchange definitions from one environment to another.

Description
There are two approaches to perform a migration.
q q

Using the DTLURDMO utility Using the Power Exchange Client tool (Detail Navigator)

DTLURDMO Utility Step 1: Validate connectivity between the client and listeners

Test communication between clients and all listeners in the production environment with: dtlrexeprog=ping <loc>=<nodename>.

Run selected jobs to exercise data access through PowerExchange data maps.

Step 2: Run DTLURDMO to copy PowerExchange objects.


At this stage, if PowerExchange is to run against new versions of the PowerExchange objects rather than existing libraries, you need to copy the datamaps. To do this, use the PowerExchange Copy Utility DTLURDMO. The following section assumes that the entire datamap set is to be copied. DTLURDMO does have the ability to copy selectively, however, and the full functionality of the utility is documented in the PowerExchange Utilities Guide. The types of definitions that can be managed with this utility are:
q

PowerExchange data maps


BEST PRACTICES 154 of 954

INFORMATICA CONFIDENTIAL

q q

PowerExchange capture registrations PowerExchange capture extraction data maps

On MVS, the input statements for this utility are taken from SYSIN. On non-MVS platforms, the input argument point to a file containing the input definition. If no input argument is provided, it looks for a file dtlurdmo.ini in the current path. The utility runs on all capture platforms.

Windows and UNIX Command Line


Syntax: DTLURDMO <dtlurdmo definition file> For example: DTLURDMO e:\powerexchange\bin\dtlurdmo.ini
q

DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates. If no definition file is specified, it looks for a file dtlurdmo.ini in the current path.

MVS DTLURDMO job utility


Run the utility by submitting the DTLURDMO job, which can be found in the RUNLIB library.
q

DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates and is read from the SYSIN card.

AS/400 utility
Syntax: CALL PGM(<location and name of DTLURDMO executable file>) For example: CALL PGM(dtllib/DTLURDMO)
q

DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates. By default, the definition is in the member CFG/DTLURDMO in the current datalib library.

If you want to create a separate DTLURDMO definition file rather than use the default location, you must give the library and filename of the definition file as a parameter. For example: CALL PGM(dtllib/ DTLURDMO) parm ('datalib/deffile(dtlurdmo)')

Running DTLURDMO
The utility should be run extracting information from the files locally, then writing out the datamaps through the new PowerExchange V8.x.x Listener. This causes the datamaps to be written out in the format required for the upgraded PowerExchange. DTLURDMO must be run once for the datamaps, then again for the registrations, and then the extract maps if this is a capture environment. Commands for mixed datamaps, registrations, and extract maps cannot be run together.
INFORMATICA CONFIDENTIAL BEST PRACTICES 155 of 954

If only a subset of the PowerExchange datamaps, registrations, and extract maps are required, then selective copies can be carried out. Details of performing selective copies are documented fully in the PowerExchange Utilities Guide. This document assumes that everything is going to be migrated from the existing environment to the new V8.x.x format.

Definition File Example


The following example shows a definition file to copy all datamaps from the existing local datamaps (the local datamaps are defined in the DATAMAP DD card in the MVS JCL or by the path on Windows or UNIX) to the V8.x.x listener (defined by the TARGET location node1): USER DTLUSR; EPWD A3156A3623298FDC; SOURCE LOCAL; TARGET NODE1; DETAIL; REPLACE; DM_COPY; SELECT schema=*; Note: The encrypted password (EPWD) is generated from the FILE, ENCRYPT PASSWORD option from the PowerExchange Navigator.

Power Exchange Client tool (Detail Navigator) Step 1: Validate connectivity between the client and listeners

Test communication between clients and all listeners in the production environment with: dtlrexeprog=ping loc=<nodename>.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

156 of 954

Run selected jobs to exercise data access through PowerExchange data maps.

Step 2: Start the Power Exchange Navigator

q q

Select the datamap that is going to be promoted to production. On the menu bar, select a file to send to the remote node.

On the drop-down list box, choose the appropriate location ( in this case mvs_prod).
q

Supply the user name and password and click OK.


q

A confirmation message for successful migration is displayed.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

157 of 954

Last updated: 06-Feb-07 11:39

INFORMATICA CONFIDENTIAL

BEST PRACTICES

158 of 954

Running Sessions in Recovery Mode Challenge


Understanding the recovery options that are available for PowerCenter when errors are encountered during the load.

Description
When a task in the workflow fails at any point, one option is to truncate the target and run the workflow again from the beginning. As an alternative, the workflow can be suspended and the error can be fixed, rather than re-processing the portion of the workflow with no errors. This option, "Suspend on Error", results in accurate and complete target data, as if the session completed successfully with one run. There are also recovery options available for workflows and tasks that can be used to handle different failure scenarios.

Configure Mapping for Recovery


For consistent recovery, the mapping needs to produce the same result, and in the same order, in the recovery execution as in the failed execution. This can be achieved by sorting the input data using either the sorted ports option in Source Qualifier (or Application Source Qualifier) or by using a sorter transformation with distinct rows option immediately after source qualifier transformation. Additionally, ensure that all the targets received data from transformations that produce repeatable data.

Configure Session for Recovery


The recovery strategy can be configured on the Properties page of the Session task. Enable the session for recovery by selecting one of the following three Recovery Strategies:
q

Resume from the last checkpoint


r

The Integration Service saves the session recovery information and updates recovery tables for a target database. If a session interrupts, the Integration Service uses the saved recovery information to recover it.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

159 of 954

The Integration Service recovers a stopped, aborted or terminated session from the last checkpoint.

Restart task
r r

The Integration Service does not save session recovery information. If a session interrupts, the Integration Service reruns the session during recovery.

Fail task and continue workflow


r

The Integration Service recovers a workflow; it does not recover the session. The session status becomes failed and the Integration Service continues running the workflow.

Configure Workflow for Recovery


The Suspend on Error option directs the Integration Service to suspend the workflow while the error is being fixed and then it resumes the workflow. The workflow is suspended when any of the following tasks fail:
q q q q

Session Command Worklet Email

When a task fails in the workflow, the Integration Service stops running tasks in the path. The Integration Service does not evaluate the output link of the failed task. If no other task is running in the workflow, the Workflow Monitor displays the status of the workflow as "Suspended." If one or more tasks are still running in the workflow when a task fails, the Integration Service stops running the failed task and continues running tasks in other paths. The Workflow Monitor displays the status of the workflow as "Suspending." When the status of the workflow is "Suspended" or "Suspending," you can fix the error, such as a target database error, and recover the workflow in the Workflow Monitor. When you recover a workflow, the Integration Service restarts the failed tasks and continues evaluating the rest of the tasks in the workflow. The Integration Service does not run any task that already completed successfully.

Truncate Target Table

INFORMATICA CONFIDENTIAL

BEST PRACTICES

160 of 954

If the truncate table option is enabled in a recovery-enabled session, the target table is not truncated during recovery process.

Session Logs
In a suspended workflow scenario, the Integration Service uses the existing session log when it resumes the workflow from the point of suspension. However, the earlier runs that caused the suspension are recorded in the historical run information in the repository.

Suspension Email
The workflow can be configured to send an email when the Integration Service suspends the workflow. When a task fails, the workflow is suspended and suspension email is sent. The error can be fixed and the workflow can be resumed subsequently. If another task fails while the Integration Service is suspending the workflow, another suspension email is not sent. The Integration Service only sends out another suspension email if another task fails after the workflow resumes. Check the "Browse Emails" button on the General tab of the Workflow Designer Edit sheet to configure the suspension email.

Suspending Worklets
When the "Suspend On Error" option is enabled for the parent workflow, the Integration Service also suspends the worklet if a task within the worklet fails. When a task in the worklet fails, the Integration Service stops executing the failed task and other tasks in its path. If no other task is running in the worklet, the status of the worklet is "Suspended". If other tasks are still running in the worklet, the status of the worklet is "Suspending". The parent workflow is also suspended when the worklet is "Suspended" or "Suspending".

Starting Recovery
The recovery process can be started using Workflow Manager or Workflow Monitor . Alternately, the recovery process can be started by using pmcmd in command line mode or by using a script.

Recovery Tables and Recovery Process


When the Integration Service runs a session that has a resume recovery strategy, it

INFORMATICA CONFIDENTIAL

BEST PRACTICES

161 of 954

writes to recovery tables on the target database system. When the Integration Service recovers the session, it uses information in the recovery tables to determine where to begin loading data to target tables. If you want the Integration Service to create the recovery tables, grant table creation privilege to the database user name that is configured in the target database connection. If you do not want the Integration Service to create the recovery tables, create the recovery tables manually. The Integration Service creates the following recovery tables in the target database: PM_RECOVERY - Contains target load information for the session run. The Integration Service removes the information from this table after each successful session and initializes the information at the beginning of subsequent sessions. PM_TGT_RUN_ID - Contains information that the Integration Service uses to identify each target on the database. The information remains in the table between session runs. If you manually create this table, you must create a row and enter a value other than zero for LAST_TGT_RUN_ID to ensure that the session recovers successfully. PM_REC_STATE - When the Integration Service runs a real-time session that uses the recovery table and that has recovery enabled, it creates a recovery table, PM_REC_STATE, on the target database to store message IDs and commit numbers. When the Integration Service recovers the session, it uses information in the recovery tables to determine if it needs to write the message to the target table. The table contains information that the Integration Service uses to determine if it needs to write messages to the target table during recovery for a real-time session. If you edit or drop the recovery tables before you recover a session, the Integration Service cannot recover the session. If you disable recovery, the Integration Service does not remove the recovery tables from the target database and you must manually remove them

Session Recovery Considerations


The following options affect whether the session is incrementally recoverable:
q

Output is deterministic. A property that determines if the transformation generates the same set of data for each session run. Output is repeatable. A property that determines if the transformation generates the data in the same order for each session run. You can set this property for Custom transformations. Lookup source is static. A Lookup transformation property that determines if the lookup source is the same between the session and recovery. The

INFORMATICA CONFIDENTIAL

BEST PRACTICES

162 of 954

Integration Service uses this property to determine if the output is deterministic.

Inconsistent Data During Recovery Process


For recovery to be effective, the recovery session must produce the same set of rows; and in the same order. Any change after initial failure (in mapping, session and/or in the Integration Service) that changes the ability to produce repeatable data, results in inconsistent data during the recovery process. The following situations may produce inconsistent data during a recovery session:
q

Session performs incremental aggregation and the Integration Service stops unexpectedly. Mapping uses sequence generator transformation. Mapping uses a normalizer transformation. Source and/or target changes after initial session failure. Data movement mode change after initial session failure. Code page (server, source or target) changes, after initial session failure. Mapping changes in a way that causes server to distribute or filter or aggregate rows differently. Session configurations are not supported by PowerCenter for session recovery. Mapping uses a lookup table and the data in the lookup table changes between session runs. Session sort order changes, when server is running in Unicode mode.

q q q q q q

HA Recovery
Highly-available recovery allows the workflow to resume automatically in the case of Integration Service failover. The following options are available in the properties tab of the workflow:
q

Enable HA recovery Allows the workflow to be configured for Highly Availability. Automatically recover terminated tasks Recover terminated Session or Command tasks without user intervention. Maximum automatic recovery attempts When you automatically recover terminated tasks, you can choose the number of times the Integration Service
BEST PRACTICES 163 of 954

INFORMATICA CONFIDENTIAL

attempts to recover the task. The default setting is 5.


Last updated: 26-May-08 11:28

INFORMATICA CONFIDENTIAL

BEST PRACTICES

164 of 954

Using PowerCenter Labels Challenge


Using labels effectively in a data warehouse or data integration project to assist with administration and migration.

Description
A label is a versioning object that can be associated with any versioned object or group of versioned objects in a repository. Labels provide a way to tag a number of object versions with a name for later identification. Therefore, a label is a named object in the repository, whose purpose is to be a pointer or reference to a group of versioned objects. For example, a label called Project X version X can be applied to all object versions that are part of that project and release. Labels can be used for many purposes:
q q q q

Track versioned objects during development Improve object query results. Create logical groups of objects for future deployment. Associate groups of objects for import and export.

Note that labels apply to individual object versions, and not objects as a whole. So if a mapping has ten versions checked in, and a label is applied to version 9, then only version 9 has that label. The other versions of that mapping do not automatically inherit that label. However, multiple labels can point to the same object for greater flexibility. The Use Repository Manager privilege is required in order to create or edit labels, To create a label, choose Versioning-Labels from the Repository Manager.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

165 of 954

When creating a new label, choose a name that is as descriptive as possible. For example, a suggested naming convention for labels is: Project_Version_Action. Include comments for further meaningful description. Locking the label is also advisable. This prevents anyone from accidentally associating additional objects with the label or removing object references for the label. Labels, like other global objects such as Queries and Deployment Groups, can have user and group privileges attached to them. This allows an administrator to create a label that can only be used by specific individuals or groups. Only those people working on a specific project should be given read/write/execute permissions for labels that are assigned to that project.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

166 of 954

Once a label is created, it should be applied to related objects. To apply the label to objects, invoke the Apply Label wizard from the Versioning >> Apply Label menu option from the menu bar in the Repository Manager (as shown in the following figure).

Applying Labels
Labels can be applied to any object and cascaded upwards and downwards to parent and/or child objects. For example, to group dependencies for a workflow, apply a label to all children objects. The Repository Server applies labels to sources, targets, mappings, and tasks associated with the workflow. Use the Move label property to point the label to the latest version of the object(s). Note: Labels can be applied to any object version in the repository except checked-out versions. Execute permission is required for applying labels. After the label has been applied to related objects, it can be used in queries and deployment groups (see the Best Practice on Deployment Groups ). Labels can also be used to manage the size of the repository (i.e. to purge object versions).

Using Labels in Deployment


An object query can be created using the existing labels (as shown below). Labels can be associated only with a dynamic deployment group. Based on the object query, objects associated with that label can be used in the deployment.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

167 of 954

Strategies for Labels


Repository Administrators and other individuals in charge of migrations should develop their own label strategies and naming conventions in the early stages of a data integration project. Be sure that developers are aware of the uses of these labels and when they should apply labels. For each planned migration between repositories, choose three labels for the development and subsequent repositories:
q q q

The first is to identify the objects that developers can mark as ready for migration. The second should apply to migrated objects, thus developing a migration audit trail. The third is to apply to objects as they are migrated into the receiving repository, completing the migration audit trail.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

168 of 954

When preparing for the migration, use the first label to construct a query to build a dynamic deployment group. The second and third labels in the process are optionally applied by the migration wizard when copying folders between versioned repositories. Developers and administrators do not need to apply the second and third labels manually. Additional labels can be created with developers to allow the progress of mappings to be tracked if desired. For example, when an object is successfully unit-tested by the developer, it can be marked as such. Developers can also label the object with a migration label at a later time if necessary. Using labels in this fashion along with the query feature allows complete or incomplete objects to be identified quickly and easily, thereby providing an object-based view of progress.

Last updated: 04-Jun-08 13:47

INFORMATICA CONFIDENTIAL

BEST PRACTICES

169 of 954

Deploying Data Analyzer Objects Challenge


To understand the methods for deploying Data Analyzer objects among repositories and the limitations of such deployment.

Description
Data Analyzer repository objects can be exported to and imported from Extensible Markup Language (XML) files. Export/import facilitates archiving the Data Analyzer repository and deploying Data Analyzer Dashboards and reports from development to production. The following repository objects in Data Analyzer can be exported and imported:
q q q q q q q q q q

Schemas Reports Time Dimensions Global Variables Dashboards Security profiles Schedules Users Groups Roles

The XML file created after exporting objects should not be modified. Any change might invalidate the XML file and result in failure of import objects into a Data Analyzer repository. For more information on exporting objects from the Data Analyzer repository, refer to the Data Analyzer Administration Guide.

Exporting Schema(s)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

170 of 954

To export the definition of a star schema or an operational schema, you need to select a metric or folder from the Metrics system folder in the Schema Directory. When you export a folder, you export the schema associated with the definitions of the metrics in that folder and its subfolders. If the folder you select for export does not contain any objects, Data Analyzer does not export any schema definition and displays the following message: There is no content to be exported. There are two ways to export metrics or folders containing metrics:
q

Select the Export Metric Definitions and All Associated Schema Table and Attribute Definitions option. If you select to export a metric and its associated schema objects, Data Analyzer exports the definitions of the metric and the schema objects associated with that metric. If you select to export an entire metric folder and its associated objects, Data Analyzer exports the definitions of all metrics in the folder, as well as schema objects associated with every metric in the folder. Alternatively, select the Export Metric Definitions Only option. When you choose to export only the definition of the selected metric, Data Analyzer does not export the definition of the schema table from which the metric is derived or any other associated schema object.

1. Login to Data Analyzer as a System Administrator. 2. Click on the Administration tab XML Export/Import Export Schemas. 3. All the metric folders in the schema directory are displayed. Click Refresh Schema to display the latest list of folders and metrics in the schema directory. 4. Select the check box for the folder or metric to be exported and click Export as XML option. 5. Enter XML filename and click Save to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting Report(s)
To export the definitions of more than one report, select multiple reports or folders. Data Analyzer exports only report definitions. It does not export the data or the schedule for cached reports. As part of the Report Definition export, Data Analyzer exports the report table, report chart, filters, indicators (i.e., gauge, chart, and table indicators), custom metrics, links to similar reports, and all reports in an analytic workflow, including links to similar reports.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

171 of 954

Reports can have public or personal indicators associated with them. By default, Data Analyzer exports only public indicators associated with a report. To export the personal indicators as well, select the Export Personal Indicators check box. To export an analytic workflow, you need to export only the originating report. When you export the originating report of an analytic workflow, Data Analyzer exports the definitions of all the workflow reports. If a report in the analytic workflow has similar reports associated with it, Data Analyzer exports the links to the similar reports. Data Analyzer does not export the alerts, schedules, or global variables associated with the report. Although Data Analyzer does not export global variables, it lists all global variables it finds in the report filter. You can, however, export these global variables separately. 1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export Reports. Select the folder or report to be exported. Click Export as XML. Enter XML filename and click Save to save the XML file. The XML file will be stored locally on the client machine.

Exporting Global Variables


1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export Global Variables. Select the Global variable to be exported. Click Export as XML. Enter the XML filename and click Save to save the XML file. The XML file will be stored locally on the client machine.

Exporting a Dashboard
Whenever a dashboard is exported, Data Analyzer exports the reports, indicators, shared documents, and gauges associated with the dashboard. Data Analyzer does not, however, export the alerts, access permissions, attributes or metrics in the report (s), or real-time objects. You can export any of the public dashboards defined in the repository, and can export more than one dashboard at one time. 1. Login to Data Analyzer as a System Administrator. 2. Click Administration XML Export/Import Export Dashboards. 3. Select the Dashboard to be exported.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

172 of 954

4. Click Export as XML. 5. Enter XML filename and click Save to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting a User Security Profile


Data Analyzer maintains a security profile for each user or group in the repository. A security profile consists of the access permissions and data restrictions that the system administrator sets for a user or group. When exporting a security profile, Data Analyzer exports access permissions for objects under the Schema Directory, which include folders, metrics, and attributes. Data Analyzer does not export access permissions for filtersets, reports, or shared documents. Data Analyzer allows you to export only one security profile at a time. If a user or group security profile you export does not have any access permissions or data restrictions, Data Analyzer does not export any object definitions and displays the following message: There is no content to be exported. 1. Login to Data Analyzer as a System Administrator. 2. Click Administration XML Export/Import Export Security Profile. 3. Click Export from users and select the user for which security profile to be exported. 4. Click Export as XML. 5. Enter XML filename and click Save to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting a Schedule
You can export a time-based or event-based schedule to an XML file. Data Analyzer runs a report with a time-based schedule on a configured schedule. Data Analyzer runs a report with an event-based schedule when a PowerCenter session completes. When you export a schedule, Data Analyzer does not export the history of the schedule. 1. 2. 3. 4. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export Schedules. Select the Schedule to be exported. Click Export as XML.
BEST PRACTICES 173 of 954

INFORMATICA CONFIDENTIAL

5. Enter XML filename and click Save to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting Users, Groups, or Roles Exporting Users


You can export the definition of any user defined in the repository. However, you cannot export the definitions of system users defined by Data Analyzer. If you have more than one thousand users defined in the repository, Data Analyzer allows you to search for the users that you want to export. You can use the asterisk (*) or the percent symbol (%) as wildcard characters to search for users to export. You can export the definitions of more than one user, including the following information:
q q q q q q q q q q q

Login name Description First, middle, and last name Title Password Change password privilege Password never expires indicator Account status Groups to which the user belongs Roles assigned to the user Query governing settings

Data Analyzer does not export the email address, reply-to address, department, or color scheme assignment associated with the exported user(s). 1. 2. 3. 4. 5. 6. 7. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export User/Group/Role. Click Export Users/Group(s)/Role(s). Select the user(s) to be exported. Click Export as XML. Enter XML filename and click Save to save the XML file. The XML file will be stored locally on the client machine.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

174 of 954

Exporting Groups
You can export any group defined in the repository, and can export the definitions of multiple groups. You can also export the definitions of all the users within a selected group. Use the asterisk (*) or percent symbol (%) as wildcard characters to search for groups to export. Each group definition includes the following information:
q q q q q q q q

Name Description Department Color scheme assignment Group hierarchy Roles assigned to the group Users assigned to the group Query governing settings

Data Analyzer does not export the color scheme associated with an exported group. 1. 2. 3. 4. 5. 6. 7. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export User/Group/Role. Click Export Users/Group(s)/Role(s). Select the group to be exported. Click Export as XML. Enter XML filename and click Save to save the XML file. The XML file will be stored locally on the client machine.

Exporting Roles
You can export the definitions of the custom roles defined in the repository. However, you cannot export the definitions of system roles defined by Data Analyzer. You can export the definitions of more than one role. Each role definition includes the name and description of the role and the permissions assigned to each role. 1. 2. 3. 4. 5. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Export User/Group/Role. Click Export Users/Group(s)/Role(s). Select the role to be exported. Click Export as XML.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

175 of 954

6. Enter XML filename and click Save to save the XML file. 7. The XML file will be stored locally on the client machine.

Importing Objects
You can import objects into the same repository or a different repository. If you import objects that already exist in the repository, you can choose to overwrite the existing objects. However, you can import only global variables that do not already exist in the repository. When you import objects, you can validate the XML file against the DTD provided by Data Analyzer. Informatica recommends that you do not modify the XML files after you export from Data Analyzer. Ordinarily, you do not need to validate an XML file that you create by exporting from Data Analyzer. However, if you are not sure of the validity of an XML file, you can validate it against the Data Analyzer DTD file when you start the import process. To import repository objects, you must have the System Administrator role or the Access XML Export/Import privilege. When you import a repository object, you become the owner of the object as if you created it. However, other system administrators can also access imported repository objects. You can limit access to reports for users who are not system administrators. If you select to publish imported reports to everyone, all users in Data Analyzer have read and write access to them. You can change the access permissions to reports after you import them.

Importing Schemas
When importing schemas, if the XML file contains only the metric definition, you must make sure that the fact table for the metric exists in the target repository. You can import a metric only if its associated fact table exists in the target repository or the definition of its associated fact table is also in the XML file. When you import a schema, Data Analyzer displays a list of all the definitions contained in the XML file. It then displays a list of all the object definitions in the XML file that already exist in the repository. You can choose to overwrite objects in the repository. If you import a schema that contains time keys, you must import or create a time dimension. 1. Login to Data Analyzer as a System Administrator.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

176 of 954

2. 3. 4. 5. 6.

Click Administration XML Export/Import Import Schema. Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.

Importing Reports
A valid XML file of exported report objects can contain definitions of cached or ondemand reports, including prompted reports. When you import a report, you must make sure that all the metrics and attributes used in the report are defined in the target repository. If you import a report that contains attributes and metrics not defined in the target repository, you can cancel the import process. If you choose to continue the import process, you may not be able to run the report correctly. To run the report, you must import or add the attribute and metric definitions to the target repository. You are the owner of all the reports you import, including the personal or public indicators associated with the reports. You can publish the imported reports to all Data Analyzer users. If you publish reports to everyone, Data Analyzer provides read-access to the reports to all users. However, it does not provide access to the folder that contains the imported reports. If you want another user to access an imported report, you can put the imported report in a public folder and have the user save or move the imported report to his or her personal folder. Any public indicator associated with the report also becomes accessible to the user. If you import a report and its corresponding analytic workflow, the XML file contains all workflow reports. If you choose to overwrite the report, Data Analyzer also overwrites the workflow reports. Also, when importing multiple workflows, note that Data Analyzer does not import analytic workflows containing the same workflow report names. Thus, ensure that all imported analytic workflows have unique report names prior to being imported. 1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import Report. Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.

Importing Global Variables

INFORMATICA CONFIDENTIAL

BEST PRACTICES

177 of 954

You can import global variables that are not defined in the target repository. If the XML file contains global variables already in the repository, you can cancel the process. If you continue the import process, Data Analyzer imports only the global variables not in the target repository. 1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import Global Variables. Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.

Importing Dashboards
Dashboards display links to reports, shared documents, alerts, and indicators. When you import a dashboard, Data Analyzer imports the following objects associated with the dashboard:
q q q q

Reports Indicators Shared documents Gauges

Data Analyzer does not import the following objects associated with the dashboard:
q q q q

Alerts Access permissions Attributes and metrics in the report Real-time objects

If an object already exists in the repository, Data Analyzer provides an option to overwrite it. Data Analyzer does not import the attributes and metrics in the reports associated with the dashboard. If the attributes or metrics in a report associated with the dashboard do not exist, the report does not display on the imported dashboard. 1. 2. 3. 4. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import Dashboard. Click Browse to choose an XML file to import. Select Validate XML against DTD.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

178 of 954

5. Click Import XML. 6. Verify all attributes on the summary page, and choose Continue.

Importing Security Profile(s)


To import a security profile, you must begin by selecting the user or group to which you want to assign the security profile. You can assign the same security profile to more than one user or group. When you import a security profile and associate it with a user or group, you can either overwrite the current security profile or add to it. When you overwrite a security profile, you assign the user or group only the access permissions and data restrictions found in the new security profile. Data Analyzer removes the old restrictions associated with the user or group. When you append a security profile, you assign the user or group the new access permissions and data restrictions in addition to the old permissions and restrictions. When exporting a security profile, Data Analyzer exports the security profile for objects in Schema Directory, including folders, attributes, and metrics. However, it does not include the security profile for filtersets. 1. 2. 3. 4. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import Security Profile. Click Import to Users. Select the user with which you want to associate the security profile you import.
r

To associate the imported security profiles with all the users on the page, select the "Users" check box at the top of the list. To associate the imported security profiles with all the users in the repository, select Import to All.. To overwrite the selected users current security profile with the imported security profile, select Overwrite.. To append the imported security profile to the selected users current security profile, select Append..

5. 6. 7. 8.

Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

179 of 954

Importing Schedule(s)
A time-based schedule runs reports based on a configured schedule. An event-based schedule runs reports when a PowerCenter session completes. You can import a timebased or event-based schedules from an XML file. When you import a schedule, Data Analyzer does not attach the schedule to any reports. 1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import Schedule. Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML. Verify all attributes on the summary page, and choose Continue.

Importing Users, Groups, or Roles


When you import a user, group, or role, you import all the information associated with each user, group, or role. The XML file includes definitions of roles assigned to users or groups, and definitions of users within groups. For this reason, you can import the definition of a user, group, or role in the same import process. When importing a user, you import the definitions of roles assigned to the user and the groups to which the user belongs. When you import a user or group, you import the user or group definitions only. The XML file does not contain the color scheme assignments, access permissions, or data restrictions for the user or group. To import the access permissions and data restrictions, you must import the security profile for the user or group. 1. 2. 3. 4. 5. 6. Login to Data Analyzer as a System Administrator. Click Administration XML Export/Import Import User/Group/Role. Click Browse to choose an XML file to import. Select Validate XML against DTD. Click Import XML option. Verify all attributes on the summary page, and choose Continue.

Tips for Importing/Exporting


q

Schedule Importing/Exporting of repository objects for a time of minimal

INFORMATICA CONFIDENTIAL

BEST PRACTICES

180 of 954

Data Analyzer activity, when most of the users are not accessing the Data Analyzer repository. This should help to prevent users from experiencing timeout errors or degraded response time. Only the System Administrator should perform import/export operations.
q

Take a backup of the Data Analyzer repository prior to performing an import/ export operation. This backup should be completed using the Repository Backup Utility provided with Data Analyzer. Manually add user/group permissions for the report. These permissions will not be exported as part of exporting Reports and should be manually added after the report is imported in the desired server. Use a version control tool. Prior to importing objects into a new environment, it is advisable to check the XML documents with a version-control tool such as Microsoft's Visual Source Safe, or PVCS. This facilitates the versioning of repository objects and provides a means for rollback to a prior version of an object, if necessary. Attach cached reports to schedules. Data Analyzer does not import the schedule with a cached report. When you import cached reports, you must attach them to schedules in the target repository. You can attach multiple imported reports to schedules in the target repository in one process immediately after you import them. Ensure that global variables exist in the target repository. If you import a report that uses global variables in the attribute filter, ensure that the global variables already exist in the target repository. If they are not in the target repository, you must either import the global variables from the source repository or recreate them in the target repository. Manually add indicators to the dashboard. When you import a dashboard, Data Analyzer imports all indicators for the originating report and workflow reports in a workflow. However, indicators for workflow reports do not display on the dashboard after you import it until added manually. Check with your System Administrator to understand what level of LDAP integration has been configured (if any). Users, groups, and roles need to be exported and imported during deployment when using repository authentication. If Data Analyzer has been integrated with an LDAP (Lightweight Directory Access Protocol) tool, then users, groups, and/or roles may not require deployment.

When you import users into a Microsoft SQL Server or IBM DB2 repository, Data Analyzer blocks all user authentication requests until the import process is complete.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

181 of 954

Installing Data Analyzer Challenge


Installing Data Analyzer on new or existing hardware, either as a dedicated application on a physical machine (as Informatica recommends) or co-existing with other applications on the same physical server or with other Web applications on the same application server.

Description
Consider the following questions when determining what type of hardware to use for Data Analyzer: If the hardware already exists: 1. 2. 3. 4. 5. Is the processor, operating system, and database software supported by Data Analyzer? Are the necessary operating system and database patches applied? How many CPUs does the machine currently have? Can the CPU capacity be expanded? How much memory does the machine have? How much is available to the Data Analyzer application? Will Data Analyzer share the machine with other applications? If yes, what are the CPU and memory requirements of the other applications?

If the hardware does not already exist: 1. Has the organization standardized on hardware or operating system vendor? 2. What type of operating system is preferred and supported? (e.g., Solaris, Windows, AIX, HP-UX, Redhat AS, SuSE) 3. What database and version is preferred and supported for the Data Analyzer repository? Regardless of the hardware vendor chosen, the hardware must be configured and sized appropriately to support the reporting response time requirements for Data Analyzer. The following questions should be answered in order to estimate the size of a Data Analyzer server: 1. 2. 3. 4. How many users are predicted for concurrent access? On average, how many rows will be returned in each report? On average, how many charts will there be for each report? Do the business requirements mandate a SSL Web server?

The hardware requirements for the Data Analyzer environment depend on the number of concurrent users, types of reports being used (i.e., interactive vs. static), average number of records in a report, application server and operating system used, among other factors. The following table should be used as a general guide for hardware recommendations for a Data Analyzer installation. Actual results may vary depending upon exact hardware configuration and user volume. For exact sizing recommendations, contact Informatica Professional Services for a Data Analyzer Sizing and Baseline Architecture engagement.

Windows
# of Concurrent Users Average Number of Rows per Report 1000 Average # of Charts per Report Estimated # of CPUs for Peak Usage Estimated Total RAM (For Data Analyzer alone) Estimated # of App servers in a Clustered Environment 1

50

1 GB

INFORMATICA CONFIDENTIAL

BEST PRACTICES

182 of 954

100 200 400 100 -100 100 100 100 100 100 100

1000 1000 1000 1000 2000 5000 10000 1000 1000 1000 1000

2 2 2 2 2 2 2 2 5 7 10

3 6 12 3 3 4 5 3 3 3 3-4

2 GB 3.5 GB 6.5 GB 2 GB 2.5 GB 3 GB 4 GB 2 GB 2 GB 2.5 GB 3 GB

1-2 3 6 1-2 1-2 2 2-3 1-2 1-2 1-2 1-2

Notes:
1. This estimating guide is based on experiments conducted in the Informatica lab. 2. The sizing estimates are based on PowerAnalyzer 5 running BEA WebLogic 8.1 SP3, Windows 2000, on a 4 CPU 2.5 GHz Xeon Processor. This estimate may not be accurate for other, different environments. 3. The number of concurrent users under peak volume can be estimated by using the number of total users multiplied by the percentage of concurrent users. In practice, typically 10 percent of the user base is concurrent. However, this percentage can be as high as 50 percent or as low as 5 percent in some organizations. 4. For every two CPUs on the server, Informatica recommends one managed server (instance) of the application server. For servers with at least four CPUs, clustering multiple logical instances of the application server on one physical server can result in increased performance. 5. There will be an increase in overhead on for a SSL Web server architecture, depending on strength of encryption. 6. CPU utilization can be minimized by 10 to 25 percent by using SVG charts, otherwise known as interactive charting, rather than the default PNG charting. 7. Clustering is recommended for instances with more than 50 concurrent users. (Clustering doesnt have to be across multiple boxes if >= 4 CPU) 8. Informatica Professional Services should be engaged for a thorough and accurate sizing estimate.

IBM AIX
# of Concurrent Users Average Number of Rows per Report 1000 Average # of Charts per Report Estimated # of CPUs for Peak Usage Estimated Total RAM (For Data Analyzer alone) Estimated # of App servers in a Clustered Environment 1

50

1 GB

INFORMATICA CONFIDENTIAL

BEST PRACTICES

183 of 954

100 200 400 100 -100 100 100 100 100 100 100

1000 1000 1000 1000 2000 5000 10000 1000 1000 1000 1000

2 2 2 2 2 2 2 2 5 7 10

2-3 4-5 9 - 10 2-3 2-3 2-3 4 2-3 2-3 2-3 2-3

2 GB 3.5 GB 6 GB 2 GB 2 GB 3 GB 4 GB 2 GB 2 GB 2 GB 2.5 GB

1 2-3 4-5 1 1-2 1-2 2 1 1 1-2 1-2

Notes:
1. This estimating guide is based on experiments conducted in the Informatica lab. 2. The sizing estimates are based on PowerAnalyzer 5 running IBM WebSphere 5.1.1.1 and AIX 5.2.02 on a 4 CPU 2.4 GHz IBM p630. This estimate may not be accurate for other, different environments. 3. The number of concurrent users under peak volume can be estimated by using the number of total users multiplied by the percentage of concurrent users. In practice, typically 10 percent of the user base is concurrent. However, this percentage can be as high as 50 percent or as low as 5 percent in some organizations. 4. For every two CPUs on the server, Informatica recommends one managed server (instance) of the application server. For servers with at least four CPUs, clustering multiple logical instances of the application server on one physical server can result in increased performance. 5. Add 30 to 50 percent overhead on for a SSL Web server architecture, depending on strength of encryption. 6. CPU utilization can be minimized by 10 to 25 percent by using SVG charts, otherwise known as interactive charting, rather than the default PNG charting. 7. Clustering is recommended for instances with more than 50 concurrent users. (Clustering doesnt have to be across multiple boxes if >= 4 CPU) 8. Informatica Professional Services should be engaged for a thorough and accurate sizing estimate.

Data Analyzer Installation


The Data Analyzer installation process involves two main components: the Data Analyzer Repository and the Data Analyzer Server, which is an application deployed on an application server. A Web server is necessary to support these components and is included with the installation of the application servers. This section discusses the installation process for JBOSS, BEA WebLogic and IBM WebSphere. The installation tips apply to both Windows and UNIX environments. This section is intended to serve as a supplement to the Data Analyzer Installation Guide. Before installing Data Analyzer, be sure to complete the following steps:
q

Verify that the hardware meets the minimum system requirements for Data Analyzer. Ensure that the combination of hardware, operating system, application server, repository database, and, optionally, authentication software
BEST PRACTICES 184 of 954

INFORMATICA CONFIDENTIAL

are supported by Data Analyzer. Ensure that sufficient space has been allocated to the Data Analyzer repository.
q q q

Apply all necessary patches to the operating system and database software. Verify connectivity to the data warehouse database (or other reporting source) and repository database. If LDAP or NT Domain is used for Data Analyzer authentication, verify connectivity to the LDAP directory server or the NT primary domain controller. The Data Analyzer license file has been obtained from technical support. On UNIX/Linux installations, the OS user that is running Data Analyzer must have execute privileges on all Data Analyzer installation executables.

q q

In addition to the standard Data Analyzer components that are installed by default, you can also install Metadata Manager. With Version 8.0, the Data Analyzer SDK and Portal Integration Kit are now installed with Data Analyzer. Refer to the Data Analyzer documentation for detailed information for these components.

Changes to Installation Process


Beginning with Data Analyzer version 7.1.4, Data Analyzer is packaged with PowerCenter Advance Edition. To install only the Data Analyzer portion, during the installation process choose the Custom Installation option. On the following screen, uncheck all of the check boxes except the Data Analyzer check box and then click Next.

Repository Configuration
To properly install Data Analyzer you need to have connectivity information for the database server where the repository is going to reside. This information includes:
q q q

Database URL Repository username Password for repository username

Installation Steps: JBOSS


INFORMATICA CONFIDENTIAL BEST PRACTICES 185 of 954

The following are the basic installation steps for Data Analyzer on JBOSS 1. Set up the Data Analyzer repository database. The Data Analyzer Server installation process will create the repository tables, but an empty database schema needs to exist and be able to be connected to via JDBC prior to installation. 2. Install Data Analyzer. The Data Analyzer installation process will install JBOSS if a version does not already exist, or an existing instance can be selected. 3. Apply the Data Analyzer license key. 4. Install the Data Analyzer Online Help.

Installation Tips: JBOSS


The following are the basic installation tips for Data Analyzer on JBOSS:
q

Beginning with PowerAnalyzer 5, multiple Data Analyzer instances can be installed on a single instance of JBOSS. Also, other applications can coexist with Data Analyzer on a single instance of JBOSS. Although this architecture should be considered during hardware sizing estimates, it allows greater flexibility during installation. For JBOSS installations on UNIX, the JBOSS Server installation program requires an X-Windows server. If JBOSS Server is installed on a machine where an X-Windows server is not installed, an X-Windows server must be installed on another machine in order to render graphics for the GUI-based installation program. For more information on installing on UNIX, please see the UNIX Servers section of the installation and configuration tips below. If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTPd in binary format To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com). During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is used, have the configuration parameters available during installation as the installer will configure all properties files at installation. The Data Analyzer license file must be applied prior to starting Data Analyzer.

Configuration Screen

INFORMATICA CONFIDENTIAL

BEST PRACTICES

186 of 954

Installation Steps: BEA WebLogic


The following are the basic installation steps for Data Analyzer on BEA WebLogic: 1. Set up the Data Analyzer repository database. The Data Analyzer Server installation process will create the repository tables, but an empty database schema needs to exist and be able to be connected to via JDBC prior to installation. 2. Install BEA WebLogic and apply the BEA license. 3. Install Data Analyzer. 4. Apply the Data Analyzer license key. 5. Install the Data Analyzer Online Help.

TIP When creating a repository in an Oracle database, make sure the storage parameters specified for the tablespace that contains the repository are not set too large. Since many target tablespaces are initially set for very large INITIAL and NEXT values, large storage parameters cause the repository to use excessive amounts of space. Also verify that the default tablespace for the user that owns the repository tables is set correctly. The following example shows how to set the recommended storage parameters, assuming the repository is stored in the REPOSITORY tablespace: ALTER TABLESPACE REPOSITORY DEFAULT STORAGE ( INITIAL 10K NEXT 10K MAXEXTENTS UNLIMITED PCTINCREASE 50 );

Installation Tips: BEA WebLogic


INFORMATICA CONFIDENTIAL BEST PRACTICES 187 of 954

The following are the basic installation tips for Data Analyzer on BEA WebLogic:
q

Beginning with PowerAnalyzer 5, multiple Data Analyzer instances can be installed on a single instance of WebLogic. Also, other applications can coexist with Data Analyzer on a single instance of WebLogic. Although this architecture should be factored in during hardware sizing estimates, it allows greater flexibility during installation. With Data Analyzer 8, there is a console version of the installation available. X-Windows is no longer required for WebLogic installations. If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTPd in binary format To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com). During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is used, have the configuration parameters available during installation since the installer will configure all properties files at installation. The Data Analyzer license file and BEA WebLogic license must be applied prior to starting Data Analyzer.

Configuration Screen

Installation Steps: IBM WebSphere


The following are the basic installation steps for Data Analyzer on IBM WebSphere: 1. Setup the Data Analyzer repository database. The Data Analyzer Server installation process will create the repository tables, but the empty database schema needs to exist and be able to be connected to via JDBC prior to installation. 2. Install IBM WebSphere and apply the WebSphere patches. WebSphere can be installed in its Base configuration or Network Deployment configuration if clustering will be utilized. In both cases, patchsets will need to be applied.
INFORMATICA CONFIDENTIAL BEST PRACTICES 188 of 954

3. 4. 5. 6.

Install Data Analyzer. Apply the Data Analyzer license key. Install the Data Analyzer Online Help. Configure the PowerCenter Integration Utility. See the section "Configuring the PowerCenter Integration Utility for WebSphere" in the PowerCenter Installation and Configuration Guide.

Installation Tips: IBM WebSphere


q

Starting in Data Analyzer 5, multiple Data Analyzer instances can be installed on a single instance of WebSphere. Also, other applications can coexist with Data Analyzer on a single instance of WebSphere. Although this architecture should be considered during sizing estimates, it allows greater flexibility during installation. q With Data Analyzer 8 there is a console version of the installation available. X-Windows is no longer required for WebSphere installations.
q

For WebSphere on UNIX installations, Data Analyzer must be installed using the root user or system administrator account. Two groups (mqm and mqbrkrs) must be created prior to the installation and the root account should be added to both of these groups. For WebSphere on Windows installations, ensure that Data Analyzer is installed under the padaemon local Windows user ID that is in the Administrative group and has the advanced user rights: "Act as part of the operating system" and "Log on as a service." During the installation, the padaemon account will need to be added to the mqm group. If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTPd in binary format. To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com). During the WebSphere installation process, the user will be prompted to enter a directory for the application server and the HTTP (web) server. In both instances, it is advisable to keep the default installation directory. Directory names for the application server and HTTP server that include spaces may result in errors. During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is utilized, have the configuration parameters available during installation as the installer will configure all properties files at installation. The Data Analyzer license file and BEA WebLogic license must be applied prior to starting Data Analyzer.

Configuration Screen

INFORMATICA CONFIDENTIAL

BEST PRACTICES

189 of 954

Installation and Configuration Tips: UNIX Servers


With Data Analyzer 8 there is a console version of the installation available. For previous versions of Data Analyzer, a graphics display server is required for a Data Analyzer installation on UNIX. On UNIX, the graphics display server is typically an X-Windows server, although an X-Window Virtual Frame Buffer (XVFB) or personal computer X-Windows software such as WRQ Reflection-X can also be used. In any case, the XWindows server does not need to exist on the local machine where Data Analyzer is being installed, but does need to be accessible. A remote X-Windows, XVFB, or PC-X Server can be used by setting the DISPLAY to the appropriate IP address, as discussed below. If the X-Windows server is not installed on the machine where Data Analyzer will be installed, Data Analyzer can be installed using an X-Windows server installed on another machine. Simply redirect the DISPLAY variable to use the XWindows server on another UNIX machine. To redirect the host output, define the environment variable DISPLAY. On the command line, type the following command and press Enter:

C shell:
setenv DISPLAY=<TCP/IP node of X-Windows server>:0

Bourne/Korn shell:
export DISPLAY=<TCP/IP node of X-Windows server>:0

Configuration

Data Analyzer requires a means to render graphics for charting and indicators. When graphics rendering is not configured properly, charts and indicators do not display properly on dashboards or reports. For Data Analyzer
BEST PRACTICES 190 of 954

INFORMATICA CONFIDENTIAL

installations using an application server with JDK 1.4 and greater, the java.awt.headless=true setting can be set in the application server startup scripts to facilitate graphics rendering for Data Analyzer. If the application server does not use JDK 1.4 or later, use an X-Windows server or XVFB to render graphics. The DISPLAY environment variable should be set to the IP address of the X-Windows or XVFB server prior to starting Data Analyzer.
q

The application server heap size is the memory allocation for the JVM. The recommended heap size depends on the memory available on the machine hosting the application server and server load, but the recommended starting point is 512MB. This setting is the first setting that should be examined when tuning a Data Analyzer instance.

Last updated: 24-Jul-07 16:40

INFORMATICA CONFIDENTIAL

BEST PRACTICES

191 of 954

Data Connectivity using PowerCenter Connect for BW Integration Server Challenge


Understanding how to use PowerCenter Connect for SAP NetWeaver - BW Option to load data into the SAP BW (Business Information Warehouse).

Description
The PowerCenter Connect for SAP NetWeaver - BW Option supports the SAP Business Information Warehouse as both a source and target.

Extracting Data from BW


PowerCenter Connect for SAP NetWeaver - BW Option lets you extract data from SAP BW to use as a source in a PowerCenter session. PowerCenter Connect for SAP NetWeaver - BW Option integrates with the Open Hub Service (OHS), SAPs framework for extracting data from BW. OHS uses data from multiple BW data sources, including SAP's InfoSources and InfoCubes. The OHS framework includes InfoSpoke programs, which extract data from BW and write the output to SAP transparent tables.

Loading Data into BW


PowerCenter Connect for SAP NetWeaver - BW Option lets you import BW target definitions into the Designer and use the target in a mapping to load data into BW. PowerCenter Connect for SAP NetWeaver - BW Option uses Business Application Program Interface (BAPI), to exchange metadata and load data into BW. PowerCenter can use SAPs business content framework to provide a high-volume data warehousing solution or SAPs Business Application Program Interface (BAPI), SAPs strategic technology for linking components into the Business Framework, to exchange metadata with BW. PowerCenter extracts and transforms data from multiple sources and uses SAPs highspeed bulk BAPIs to load the data into BW, where it is integrated with industry-specific models for analysis through the SAP Business Explorer tool.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

192 of 954

Using PowerCenter with PowerCenter Connect to Populate BW


The following paragraphs summarize some of the key differences in using PowerCenter with the PowerCenter Connect to populate a SAP BW rather than working with standard RDBMS sources and targets.
q

BW uses a pull model. The BW must request data from a source system before the source system can send data to the BW. PowerCenter must first register with the BW using SAPs Remote Function Call (RFC) protocol. The native interface to communicate with BW is the Staging BAPI, an API published and supported by SAP. Three products in the PowerCenter suite use this API. PowerCenter Designer uses the Staging BAPI to import metadata for the target transfer structures; PowerCenter Integration Server for BW uses the Staging BAPI to register with BW and receive requests to run sessions; and the PowerCenter Server uses the Staging BAPI to perform metadata verification and load data into BW. Programs communicating with BW use the SAP standard saprfc.ini file to communicate with BW. The saprfc.ini file is similar to the tnsnames file in Oracle or the interface file in Sybase. The PowerCenter Designer reads metadata from BW and the PowerCenter Server writes data to BW. BW requires that all metadata extensions be defined in the BW Administrator Workbench. The definition must be imported to Designer. An active structure is the target for PowerCenter mappings loading BW. Because of the pull model, BW must control all scheduling. BW invokes the PowerCenter session when the InfoPackage is scheduled to run in BW. BW only supports insertion of data into BW. There is no concept of update or deletes through the staging BAPI.

Steps for Extracting Data from BW


The process of extracting data from SAP BW is quite similar to extracting data from SAP. Similar transports are used on the SAP side, and data type support is the same as that supported for SAP PowerCenter Connect. The steps required for extracting data are: 1. Create an InfoSpoke. Create an InfoSpoke in the BW to extract the data from the BW database and write it to either a database table or a file output target. 2. Import the ABAP program. Import the Informatica-provided ABAP program,
INFORMATICA CONFIDENTIAL BEST PRACTICES 193 of 954

3. 4. 5. 6.

which calls the workflow created in the Workflow Manager. Create a mapping. Create a mapping in the Designer that uses the database table or file output target as a source. Create a workflow to extract data from BW. Create a workflow and session task to automate data extraction from BW. Create a Process Chain. A BW Process Chain links programs together to run in sequence. Create a Process Chain to link the InfoSpoke and ABAP programs together. Schedule the data extraction from BW. Set up a schedule in BW to automate data extraction.

Steps To Load Data into BW


1. Install and Configure PowerCenter Components. The installation of the PowerCenter Connect for SAP NetWeaver - BW Option includes both a client and a server component. The Connect server must be installed in the same directory as the PowerCenter Server. Informatica recommends installing the Connect client tools in the same directory as the PowerCenter Client. For more details on installation and configuration refer to the PowerCenter and the PowerCenter Connect installation guides. Note: On SAP Transports for PowerConnect version 8.1 and above, it is crucial to install or upgrade PowerCenter 8.1 transports on the appropriate SAP system, when installing or upgrading PowerCenter Connect for SAP NetWeaver - BW Option. If you are extracting data from BW using OHS, you must also configure the mySAP option. If the BW system is separate from the SAP system, install the designated transports on the BW system. It is also important to note that there are now three categories of transports (as compared to two in previous versions). These are as follows:
q

Transports for SAP versions 3.1H and 3.1I.


q

Transports for SAP versions 4.0B to 4.6B, 4.6C, and non-Unicode versions 4.7 and above.
q

Transports for SAP Unicode versions 4.7 and above; this category has been added for Unicode extraction support which was not previously available in SAP versions 4.6 and earlier.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

194 of 954

2. Build the BW Components. To load data into BW, you must build components in both BW and PowerCenter. You must first build the BW components in the Administrator Workbench:
q

Define PowerCenter as a source system to BW. BW requires an external source definition for all non-R/3 sources.

q q

Create the InfoObjects in BW (this is similar to a database table).


The InfoSource represents a provider structure. Create the InfoSource in the BW Administrator Workbench and import the definition into the PowerCenter Warehouse Designer. Assign the InfoSource to the PowerCenter source system. After you create an InfoSource, assign it to the PowerCenter source system. Activate the InfoSource. When you activate the InfoSource, you activate the InfoObjects and the transfer rules.

3. Configure the sparfc.ini file. Required for PowerCenter and Connect to connect to BW. PowerCenter uses two types of entries to connect to BW through the saprfc.ini file:
q

Type A. Used by PowerCenter Client and PowerCenter Server. Specifies the BW application server. Type R. Used by the PowerCenter Connect for SAP NetWeaver - BW Option. Specifies the external program, which is registered at the SAP gateway. Note: Do not use Notepad to edit the sparfc.ini file because Notepad can corrupt the file. Set RFC_INI environment variable for all Windows NT, Windows 2000, and Windows 95/98 machines with saprfc.ini file. RFC_INI is used to locate the saprfc.ini.

4. Start the Connect for BW server Start Connect for BW server after you start PowerCenter Server and before you create InfoPackage in BW.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

195 of 954

5. Build mappings Import the InfoSource into the PowerCenter repository and build a mapping using the InfoSource as a target. The following restrictions apply to building mappings with BW InfoSource target:
q q q q q

You cannot use BW as a lookup table. You can use only one transfer structure for each mapping. You cannot execute stored procedure in a BW target. You cannot partition pipelines with a BW target. You cannot copy fields that are prefaced with /BIC/ from the InfoSource definition into other transformations. You cannot build an update strategy in a mapping. BW supports only inserts; it does not support updates or deletes. You can use Update Strategy transformation in a mapping, but the Connect for BW Server attempts to insert all records, even those marked for update or delete.

6. Load data To load data into BW from PowerCenter, both PowerCenter and the BW system must be configured. Use the following steps to load data into BW:
q

Configure a workflow to load data into BW. Create a session in a workflow that uses a mapping with an InfoSource target definition. Create and schedule an InfoPackage. The InfoPackage associates the PowerCenter session with the InfoSource. When the Connect for BW Server starts, it communicates with the BW to register itself as a server. The Connect for BW Server waits for a request from the BW to start the workflow. When the InfoPackage starts, the BW communicates with the registered Connect for BW Server and sends the workflow name to be scheduled with the PowerCenter Server. The Connect for BW Server reads information about the workflow and sends a request to the PowerCenter Server to run the workflow. The PowerCenter Server validates the workflow name in the repository and the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

196 of 954

workflow name in the InfoPackage. The PowerCenter Server executes the session and loads the data into BW. You must start the Connect for BW Server after you restart the PowerCenter Server.

Supported Datatypes
The PowerCenter Server transforms data based on the Informatica transformation datatypes. BW can only receive data in 250 bytes per packet. The PowerCenter Server converts all data to a CHAR datatype and puts it into packets of 250 bytes, plus one byte for a continuation flag. BW receives data until it reads the continuation flag set to zero. Within the transfer structure, BW then converts the data to the BW datatype. Currently, BW only supports the following datatypes in transfer structures assigned to BAPI source systems (PowerCenter ): CHAR, CUKY, CURR, DATS, NUMC, TIMS, UNIT. All other datatypes result in the following error in BW: Invalid data type (data type name) for source system of type BAPI.

Date/Time Datatypes
The transformation date/time datatype supports dates with precision to the second. If you import a date/time value that includes milliseconds, the PowerCenter Server truncates to seconds. If you write a date/time value to a target column that supports milliseconds, the PowerCenter Server inserts zeros for the millisecond portion of the date.

Binary Datatypes
BW does not allow you to build a transfer structure with binary datatypes. Therefore, you cannot load binary data from PowerCenter into BW.

Numeric Datatypes
PowerCenter does not support the INT1 datatype.

Performance Enhancement for Loading into SAP BW

INFORMATICA CONFIDENTIAL

BEST PRACTICES

197 of 954

If you see a performance slowdown for sessions that load into SAP BW, set the default buffer block size to 15MB to 20MB to enhance performance. You can put 5,000 to 10,000 rows per block, so you can calculate the buffer block size needed with the following formula: Row size x Rows per block = Default Buffer Block size For example, if your target row size is 2KB: 2 KB x 10,000 = 20MB.

Last updated: 04-Jun-08 16:31

INFORMATICA CONFIDENTIAL

BEST PRACTICES

198 of 954

Data Connectivity using PowerExchange for WebSphere MQ Challenge


Integrate WebSphere MQ applications with PowerCenter mappings.

Description
With increasing requirements for both on-demand real-time data integration and the development of Enterprise Application Integration (EAI) architectures, WebSphere MQ has become an important part of the Informatica data integration platform. PowerExchange for WebSphere MQ provides data integration for transactional data generated by continuously messaging systems. PowerCenters Zero Latency (ZL) Engine provides immediate processing of trickle-feed data for these types of messaging systems that allows both uni-directional and bi-directional processing of real-time data flow.

High Volume System Considerations


When working with high volume systems, two things to consider are the volume and the size of the messages coming over the network and whether or not the messages are persistent or non-persistent. Although a queue may be configured for persistence, a specific message can override this setting. When a message is persistent, the Queue Manager first writes the message out to a log before it allows it to be visible in the queue. In a very high volume flow, if this is not handled correctly, it can lead to performance degradation and cause the logging to potentially fill up the file system. Non-persistent messages are immediately visible in the queue for processing, but unlike persistent messages, if the Queue Manager or server crashes they cannot be recovered. To handle this type of flow volume, PowerCenter workflows can be configured to run in a Grid environment. The image below shows the two options that are available for persistence when creating a Local Queue:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

199 of 954

In conjunction with the PowerCenter Grid option, WebSphere MQ can also be clustered to allow multiple Queue Managers to process the same message flow(s). In this type of configuration, separate Integration Services can be created to each hold unique MQSERVER environment variables. Alternately, a Client Connection can be created for one Integration Service, with multiple connection properties configured for each Queue Manager in the cluster that holds the flow.

Message Affinity
Message Affinity is a consideration that is unique to clustered environments. Message Affinity occurs when the order in which a message should be processed happens out of sync. Example: Solution: In a trading system environment, a users sell message comes before the buy message. To help limit this behavior messages can have a unique id placed in the message header to show grouping as well as order. IMPORTANT -- It is not a common practice for the resequencing of these messages to be placed on the middleware software. The sending and receiving application should be responsible for this algorithm.

Message Sizes
The message size for any given flow needs to be determined before the development and architecture of workflows and queues. By default, all messaging communication objects are set to allow up to a 4 MB message size. If a message in the flow is larger than 4 MB the Queue Manager will log an error and allow the message through.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

200 of 954

To overcome this issue MQCHLLIB/MQCHLTAB environment variables must be used. The following settings must also be modified to allow for the larger message(s) in the queue. 1. Client Connection Channel: Set the Maximum Message Length to the largest estimated message size (100 MB limit).

2. Local Queue: Set the Max Message Length to the largest message size (100 MB limit).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

201 of 954

3. Queue Manager: The Queue Manager Max Message Length setting is key to allowing other objects to allow messages through. If the Queue Manager has a Max Message Length set to anything smaller than what is set in a Channel or a Local Queue the message will fail. For large messaging systems, create a separate Queue Manager just for those flows. Maximum size a Queue Manager can handle is 100 MB.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

202 of 954

Example:

A high volume application requiring PowerCenter to process a minimum 200 MSG/Sec 24/7. One message has four segments and each segment loads to a separate table. Three of the segments are optional and may not be present in a given message. The message is XML and must go thru a midstream XML parser in order to get the separate data out for each table. If a midstream XML Parser cannot handle segmenting the XML and loading it to the correct database tables fast enough to keep up with the message flow, messages can back up and cause the Queue Manager to overflow. First estimate each messages maximum size and then create a separate queue for each of the separate segments within the message. Create individual workflows to handle each queue and to load the data to the correct table. Then use an expression in PowerCenter to break out each segment and load it to the associated queue. For the optional segments, if they dont exist, there is nothing to load. Each workflow can then separately load the segmented XML into its own Mid Stream XML parser and into the correct database. Processing speed thru PowerCenter increased to 400450 MSG/Sec.

Solution:

Result:

Last updated: 27-May-08 13:07

INFORMATICA CONFIDENTIAL

BEST PRACTICES

203 of 954

Data Connectivity using PowerExchange for SAP NetWeaver Challenge


Understanding how to install PowerExchange for SAP NetWeaver, extract data from SAP R/3, and load data into SAP R/3.

Description
SAP R/3 is an ERP software that provides multiple business applications/modules, such as financial accounting, materials management, sales and distribution, human resources, CRM and SRM. The CORE R/3 system (BASIS layer) is programmed in Advance Business Application Programming-Fourth Generation (ABAP/4, or ABAP), a language proprietary to SAP. PowerExchange for SAP NetWeaver can write/read/change data in R/3 via BAPI/RFC and IDoc interfaces. The ABAP interface of PowerExchange for SAP NetWeaver can only read data from SAP R/3. PowerExchange for SAP NetWeaver provides the ability to extract SAP R/3 data into data warehouses, data integration applications, and other third-party applications. All of this is accomplished without writing complex ABAP code. PowerExchange for SAP NetWeaver generates ABAP programs and is capable of extracting data from transparent tables, pool tables, and cluster tables. When integrated with R/3 using ALE (Application Link Enabling), PowerExchange for SAP NetWeaver can also extract data from R/3 using outbound IDocs (Intermediate Documents) in near real-time. The ALE concept available in R/3 Release 3.0 supports the construction and operation of distributed applications. It incorporates controlled exchange of business data messages while ensuring data consistency across loosely-coupled SAP applications. The integration of various applications is achieved by using synchronous and asynchronous communication, rather than by means of a central database. The database server stores the physical tables in the R/3 system, while the application server stores the logical tables. A transparent table definition on the application server is represented by a single physical table on the database server. Pool and cluster tables are logical definitions on the application server that do not have a one-to-one relationship with a physical table on the database server.

Communication Interfaces
TCP/IP is the native communication interface between PowerCenter and SAP R/3. Other interfaces between the two include: Common Program Interface-Communications (CPI-C). CPI-C communication protocol enables online data exchange and data conversion between R/3 and PowerCenter. To initialize CPI-C communication with PowerCenter, SAP R/3 requires information such as the host name of the application server and the SAP gateway. This information is stored on the PowerCenter Server in a configuration file named sideinfo. The PowerCenter Server uses parameters in the sideinfo file to execute ABAP stream mode sessions. Remote Function Call (RFC). RFC is the remote communication protocol used by SAP and is based on RPC (Remote Procedure Call). To execute remote calls from PowerCenter, SAP R/3 requires information such as the connection type and the service name and gateway on the application server. This information is stored on the PowerCenter Client and PowerCenter Server in a configuration file named saprfc.ini. PowerCenter makes remote function calls when importing source definitions, installing ABAP programs and running ABAP file mode sessions. Transport system. The transport system in SAP is a mechanism to transfer objects developed on one system to
INFORMATICA CONFIDENTIAL BEST PRACTICES 204 of 954

another system. Transport system is primarily used to migrate code and configuration from development to QA and production systems. It can be used in the following cases:
q q

PowerExchange for SAP NetWeaver installation transports PowerExchange for SAP NetWeaver generated ABAP programs

Note: If the ABAP programs are installed in the $TMP development class, they cannot be transported from development to production. Ensure you have a transportable development class/package for the ABAP mappings. Security You must have proper authorizations on the R/3 system to perform integration tasks. The R/3 administrator needs to create authorizations, profiles, and users for PowerCenter users.

Integration Feature Import Definitions, Install Programs Extract Data

Authorization Object Activity S_DEVELOP All activities. Also need to set Development Object ID to PROG READ

S_TABU_DIS

Run File Mode Sessions

S_DATASET

WRITE

Submit Background Job Release Background Job

S_PROGRAM S_BTCH_JOB

BTCSUBMIT, SUBMIT DELE, LIST, PLAN, SHOW Also need to set Job Operation to RELE

Run Stream Mode Sessions Authorize RFC privileges

S_CPIC S_RFC

All activities All activities

You also need access to the SAP GUI, as described in following SAP GUI Parameters table:

Parameter

Feature references to this variable

Comments

INFORMATICA CONFIDENTIAL

BEST PRACTICES

205 of 954

User ID

$SAP_USERID

Identify the username that connects to the SAP GUI and is authorized for read-only access to the following transactions: - SE12 - SE15 - SE16 - SPRO

Password

$SAP_PASSWORD

Identify the password for the above user Identify the SAP system number Identify the SAP client number Identify the server on which this instance of SAP is running

System Number Client Number Server

$SAP_SYSTEM_NUMBER $SAP_CLIENT_NUMBER $SAP_SERVER

Key Capabilities of PowerExchange for SAP NetWeaver


Some key capabilities of PowerExchange for SAP NetWeaver include:
q q q

Extract data from SAP R/3 using ABAP BAPI /RFC and IDoc interfaces. Migrate/load data from any source into R/3 using IDoc, BAPI/RFC and DMI interfaces. Generate DMI files ready to be loaded into SAP via SXDA TOOLS or LSMW or SAP standard delivered programs. Support calling BAPI and RFC functions dynamically from PowerCenter for data integration. PowerExchange for SAP NetWeaver can make BAPI and RFC function calls dynamically from mappings to extract or load. Capture changes to the master and transactional data in SAP R/3 using ALE. PowerExchange for SAP NetWeaver can receive outbound IDocs from SAP R/3 in real time and load into SAP R/3 using inbound IDocs. To receive IDocs in real time using ALE, install PowerExchange for SAP NetWeaver on PowerCenterRT. Provide rapid development of the data warehouse based on R/3 data using Analytic Business Components for SAP R/3 (ABC). ABC is a set of business content that includes mappings, mapplets, source objects, targets, and transformations. Set partition points in a pipeline for outbound/inbound IDoc sessions; sessions that fail when reading outbound IDocs from an SAP R/3 source can be configured for recovery. You can also receive data from outbound IDoc files and write data to inbound IDoc files. Insert ABAP Code Block to add functionality to the ABAP program flow and use static/dynamic filters to reduce return rows. Customize the ABAP program flow with joins, filters, SAP functions, and code blocks. For example: qualifying table = table1-field1 = table2-field2 where the qualifying table is the last table in the condition based on the join order including outer joins. Create ABAP program variables to represent SAP R/3 structures, structure fields, or values in the ABAP
BEST PRACTICES 206 of 954

INFORMATICA CONFIDENTIAL

program.
q q

Remove ABAP program information from SAP R/3 and the repository when a folder is deleted. Provide enhanced platform support by running on 64-bit AIX and HP-UX (Itanium). You can install PowerExchange for SAP NetWeaver for the PowerCenter Server and Repository Server on SuSe Linux or on Red Hat Linux.

Installation and Configuration Steps


PowerExchange for SAP NetWeaver setup programs install components for PowerCenter Server, Client, and repository server. These programs install drivers, connection files, and a repository plug-in XML file that enables integration between PowerCenter and SAP R/3. Setup programs can also install PowerExchange for SAP NetWeaver Analytic Business Components, and PowerExchange for SAP NetWeaver Metadata Exchange. The PowerExchange for SAP NetWeaver repository plug-in is called sapplg.xml. After the plug-in is installed, it needs to be registered in the PowerCenter repository.

For SAP R/3


Informatica provides a group of customized objects required for R/3 integration in the form of transport files. These objects include tables, programs, structures, and functions that PowerExchange for SAP NetWeaver exports to data files. The R/3 system administrator must use the transport control program, tp import, to transport these object files on the R/3 system. The transport process creates a development class called ZERP. The SAPTRANS directory contains data and co files. The data files are the actual transport objects. The co files are control files containing information about the transport request. The R/3 system needs development objects and user profiles established to communicate with PowerCenter. Preparing R/3 for integration involves the following tasks:
q

Transport the development objects on the PowerCenter CD to R/3. PowerCenter calls these objects each time it makes a request to the R/3 system. Run the transport program that generates unique Ids. Establish profiles in the R/3 system for PowerCenter users. Create a development class for the ABAP programs that PowerCenter installs on the SAP R/3 system.

q q q

For PowerCenter
The PowerCenter server and client need drivers and connection files to communicate with SAP R/3. Preparing PowerCenter for integration involves the following tasks:
q q

Run installation programs on PowerCenter Server and Client machines. Configure the connection files:
r

The sideinfo file on the PowerCenter Server allows PowerCenter to initiate CPI-C with the R/3 system. Following are the required parameters for sideinfo : DEST logical name of the R/3 system TYPE set to A to indicate connection to specific R/3 system. ASHOST host name of the SAP R/3 application server. SYSNR system number of the SAP R/3 application server.

-The saprfc.ini file on the PowerCenter Client and Server allows PowerCenter to connect to the R/3 system as an RFC client. The required parameters for sideinfo are: DEST logical name of the R/3 system LU host name of the SAP application server machine

INFORMATICA CONFIDENTIAL

BEST PRACTICES

207 of 954

TP set to sapdp<system number> GWHOST host name of the SAP gateway machine. GWSERV set to sapgw<system number> ROTOCOL set to I for TCP/IP connection. Following is the summary of required steps: 1. 2. 3. 4. 5. 6. Install PowerExchange for SAP NetWeaver on PowerCenter. Configure the sideinfo file. Configure the saprfc.ini Set the RFC_INI environment variable. Configure an application connection for SAP R/3 sources in the Workflow Manager. Configure SAP/ALE IDoc connection in the Workflow Manager to receive IDocs generated by the SAP R/3 system. 7. Configure the FTP connection to access staging files through FTP. 8. Install the repository plug-in in the PowerCenter repository.

Configuring the Services File Windows


If SAPGUI is not installed, you must make entries in the Services file to run stream mode sessions. This is found in the \WINNT\SYSTEM32\drivers\etc directory. Entries should be similar to the following: sapdp<system number> <port number of dispatcher service>/tcp sapgw<system number> <port number of gateway service>/tcp Note: SAPGUI is not technically required, but experience has shown that evaluators typically want to log into the R/3 system to use the ABAP workbench and to view table contents.

UNIX
Services file is located in /etc
q q

sapdp<system number> <port# of dispatcher service>/TCP sapgw<system number> <port# of gateway service>/TCP

The system number and port numbers are provided by the BASIS administrator.

Configure Connections to Run Sessions


Informatica supports two methods of communication between the SAP R/3 system and the PowerCenter Server.
q

Streaming Mode does not create any intermediate files on the R/3 system. This method is faster, but uses more CPU cycles on the R/3 system. File Mode creates an intermediate file on the SAP R/3 system, which is then transferred to the machine running the PowerCenter Server.

If you want to run file mode sessions, you must provide either FTP access or NFS access from the machine running the PowerCenter Server to the machine running SAP R/3. This, of course, assumes that PowerCenter and SAP R/3 are not running on the same machine; it is possible to run PowerCenter and R/3 on the same system, but highly
INFORMATICA CONFIDENTIAL BEST PRACTICES 208 of 954

unlikely. If you want to use File mode sessions and your R/3 system is on a UNIX system, you need to do one of the following:
q q

Provide the login and password for the UNIX account used to run the SAP R/3 system. Provide a login and password for a UNIX account belonging to same group as the UNIX account used to run the SAP R/3 system. Create a directory on the machine running SAP R/3, and run chmod g+s on that directory. Provide the login and password for the account used to create this directory.

Configure database connections in the Server Manager to access the SAP R/3 system when running a session, then configure an FTP connection to access staging file through FTP.

Extraction Process
R/3 source definitions can be imported from the logical tables using RFC protocol. Extracting data from R/3 is a fourstep process: Import source definitions. The PowerCenter Designer connects to the R/3 application server using RFC. The Designer calls a function in the R/3 system to import source definitions. Note: If you plan to join two or more tables in SAP, be sure you have the optimized join conditions. Make sure you have identified your driving table (e.g., if you plan to extract data from bkpf and bseg accounting tables, be sure to drive your extracts from bkpf table). There is a significant difference in performance if the joins are properly defined. Create a mapping. When creating a mapping using an R/3 source definition, you must use an ERP source qualifier. In the ERP source qualifier, you can customize properties of the ABAP program that the R/3 server uses to extract source data. You can also use joins, filters, ABAP program variables, ABAP code blocks, and SAP functions to customize the ABAP program. Generate and install ABAP program. You can install two types of ABAP programs for each mapping:
q

File mode. Extract data to file. The PowerCenter Server accesses the file through FTP or NFS mount. This mode is used for large extracts as there are timeouts set in SAP for long running queries. Stream Mode. Extract data to buffers. The PowerCenter Server accesses the buffers through CPI-C, the SAP protocol for program-to-program communication. This mode is preferred for short running extracts.

You can modify the ABAP program block and customize according to your requirements (e.g., if you want to get data incrementally, create a mapping variable/parameter and use it in the ABAP program).

Create Session and Run Workflow


q

Stream Mode. In stream mode, the installed ABAP program creates buffers on the application server. The program extracts source data and loads it into the buffers. When a buffer fills, the program streams the data to the PowerCenter Server using CPI-C. With this method, the PowerCenter Server can process data when it is received. File Mode. When running a session in file mode, the session must be configured to access the file through NFS mount or FTP. When the session runs, the installed ABAP program creates a file on the application server. The program extracts source data and loads it into the file. When the file is complete, the PowerCenter Server accesses the file through FTP or NFS mount and continues processing the session.

Data Integration Using RFC/BAPI Functions


INFORMATICA CONFIDENTIAL BEST PRACTICES 209 of 954

PowerExchange for SAP NetWeaver can generate RFC/BAPI function mappings in the Designer to extract data from SAP R/3, change data in R/3, or load data into R/3. When it uses an RFC/BAPI function mapping in a workflow, the PowerCenter Server makes the RFC function calls on R/3 directly to process the R/3 data. It doesnt have to generate and install the ABAP program for data extraction.

Data Integration Using ALE


PowerExchange for SAP NetWeaver can integrate PowerCenter with SAP R/3 using ALE. With PowerExchange for SAP NetWeaver, PowerCenter can generate mappings in the Designer to receive outbound IDocs from SAP R/3 in real time. It can also generate mappings to send inbound IDocs to SAP for data integration. When PowerCenter uses an inbound or outbound mapping in a workflow to process data in SAP R/3 using ALE, it doesnt have to generate and install the ABAP program for data extraction.

Analytical Business Components


Analytic Business Components for SAP R/3 (ABC) allows you to use predefined business logic to extract and transform R/3 data. It works in conjunction with PowerCenter and PowerExchange for SAP NetWeaver to extract master data, perform lookups, provide documents, and other fact and dimension data from the following R/3 modules:
q q q q q q

Financial Accounting Controlling Materials Management Personnel Administration and Payroll Accounting Personnel Planning and Development Sales and Distribution

Refer to the ABC Guide for complete installation and configuration information.

Last updated: 04-Jun-08 17:30

INFORMATICA CONFIDENTIAL

BEST PRACTICES

210 of 954

Data Connectivity using PowerExchange for Web Services Challenge


Understanding PowerExchange for Web Services and configuring PowerCenter to access a secure web service.

Description
PowerExchange for Web Services is a service oriented integration technology that can be utilized for bringing application logic that is embedded in existing systems into the PowerCenter data integration platform. Leveraging the logic in existing systems is a cost-effective method for data integration. For example, an insurance policy score calculation logic that is available in a mainframe application can be exposed as a web service and then used by PowerCenter mappings. PowerExchange for Web Services (WebServices Consumer) allows PowerCenter to act as a web services client to consume external web services. PowerExchange for Web Services uses the Simple Object Access Protocol (SOAP) to communicate with the external web service provider. An external web service can be invoked from PowerCenter in three ways:
q q q

Web Service source Web Service transformation Web Service target

In order to increase performance of message transmission, SOAP requests and responses can be compressed. Furthermore, pass-through partitioned sessions can be used for increasing parallelism in the case of large data volumes.

Web Service Source Usage


PowerCenter supports a request-response type of operation when using a Web Services source. The web service can be used as a source if the input in the SOAP request remains fairly constant (since input values for a web service source can only be provided at the source transformation level). Although Web services source definitions

INFORMATICA CONFIDENTIAL

BEST PRACTICES

211 of 954

can be created without using a WSDL they can be edited in the WSDL workspace in PowerCenter Designer.

Web Service Transformation Usage


PowerCenter also supports a request-response type of operation when using a Web Services transformation. The web service can be used as a transformation if input data is available midstream and the response values will be captured from the web service. The following steps provide an example for invoking a Stock Quote web service to learn the price of each of the ticker symbols available in a flat file: 1. In Transformation Developer, create a web service consumer transformation. 2. Specify the URL for the stock quote wsdl and choose the operation get quote. 3. Connect the input port of this transformation to the field containing the ticker symbols. 4. To invoke the web service for each input row, change to source-based commit and an interval of 1. Also change the Transaction Scope to Transaction in the web services consumer transformation.

Web Service Target Usage


PowerCenter supports a one-way type of operation when using a Web Services target. The web service can be used as a target if it is needed only to send a message (and no response is needed). PowerCenter only waits for the web server to start processing the message; it does not wait for the web server to finish processing the web service operation. Existing relational and flat files can be used for the target definitions; or target columns can be defined manually.

PowerExchange for Web Services and Web Services Provider


PowerCenter Web Services Provider is a separate product from PowerExchange for Web Services. An advantage to using PowerCenter Web Services Provider is that it decouples the web service that needs to be consumed from the client. By using PowerCenter as the glue, changes can be made that are transparent to the client. This is useful because often there is no access to the client code or to the web service. Other considerations include:
q

PowerCenter Web Services Provider acts as a Service Provider and exposes many key functionalities as web services.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

212 of 954

In PowerExchange for Web Services, PowerCenter acts as a web service client and consumes external web services. It is not necessary to install or configure Web Services Provider in order to use PowerExchange for Web Services. Web Services exposed through PowerCenter have two formats that can be invoked by different kinds of client programs (e.g., C#, Java, .net) by using the WSDL that can be generated from the Web Services Hub.
r

Real-Time: In real time mode, web enabled workflows are exposed. The Web Services Provider must be used and be pointed to the workflow that is going to be invoked as a web service. Workflows can be started and protected. Batch: In batch mode, a pre-set of services are exposed to run and monitor workflows in PowerCenter. This feature can be used for reporting and monitoring purposes.

Last but not least, PowerCenters open architecture facilitates HTTP and HTTPS requests with an http transformation for GET, POST, and SIMPLE POST methods to read from or write data to an HTTP server.

Configuring PowerCenter to Invoke a Secure Web Service


Secure Sockets Layer (SSL) is used to provide security features such as authentication and encryption to web services applications. The authentication certificates follow the Public Key Infrastructure (PKI) standard, a system of digital certificates provided by certificate authorities to verify and authenticate parties of Internet communications or transactions. These certificates are managed in the following two keystore files:
q

Trust store. A trust store holds the public keys for the entities it can trust. Integration Service uses the entries in the trust store file to authenticate the external web services servers. Client store. A client store holds both the entitys public and private keys. Integration Service sends the entries in the client store file to the web services provider so that the web services provider can authenticate the Integration Service.

By default, the trust certificates file is named ca-bundle.crt and contains certificates issued by major, trusted certificate authorities. The ca-bundle.crt file is located in <PowerCenter Installation Directory>/server/bin.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

213 of 954

SSL authentication can be performed in three ways:


q q q

Server authentication Client authentication Mutual authentication

All of the SSL authentication configurations can be set by entering values for Web Service application connections in Workflow Manager.

Server Authentication:
Since the web service provider is the server and the Integration Service is the client, the web service provider is responsible for authenticating the Integration Service. The Integration Service sends the web service provider a client certificate file containing a public key and the web service provider verifies this file. The client certificate file and the corresponding private key file should be configured for this option.

Client Authentication:
Since the Integration Service is the client of the web service provider, it establishes an SSL session to authenticate the web service provider. The Integration Service verifies that the authentication certificate sent by the web service provider exists in the trust certificates file. The trust certificates file should be configured for this option.

Mutual Authentication
The Integration Service and web service provider exchange certificates and verify each other. For this option the trust certificates file, the client certificate and the corresponding private key file should be configured.

Converting Other Formats of Certificate Files


There are a number of other formats of certificate files available: DER format (.cer and . der extensions); PEM format (.pem extension); and PKCS#12 format (.pfx or .P12 extension). The private key for a client certificate must be in PEM format. Files can be converted from one format of certificate to another using the OpenSSL utility. Refer to the OpenSSL documentation for complete information on such conversions. A few examples are given below:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

214 of 954

To convert from DER to PEM (assuming there is a DER file called server.der) openssl x509 -in server.der -inform DER -out server.pem -outform PEM To convert a PKCS12 file called server.pfx to PEM openssl pkcs12 -in server.pfx -out server.pem

Web Service Performance Tips


The basis of Web Services communication takes place in the form of XML Documents. The performance does get affected by the type of requests that are being transmitted. Below are some tips that can help to improve performance.
q q

Avoid frequent transmissions of huge data elements. The nesting of elements in a SOAP request has a significant effect on performance. Run these requests in verbose data mode in order to check for this. When data is being retrieved for aggregation purposes or for financial calculations (i.e., not real-time) shift those requests to non-peak hours to improve response time. Capture the response time for each request sent, by using Sysdate in an expression before the web service transformation, and in an expression after. This will show the true latency which can then be averaged to determine scaling needs. Try to limit the number of web service calls (when possible). If you are using the same calls multiple times to return pieces of information for different targets, it would be better to return a complete set of results with a unique ID and then stage the sourcing for the different targets. Sending simple datatypes (e.g., integer, float, string) improves performance.

Last updated: 27-May-08 16:45

INFORMATICA CONFIDENTIAL

BEST PRACTICES

215 of 954

Data Migration Principles Challenge


A successful Data Migration effort is often critical to a system implementation. These implementations can be a new or upgraded ERP package, integration due to merger and acquisition activity or the development of a new operational system. The effort and criticality of the Data Migration as part of the larger system implementation project is often overlooked, underestimated or given a lower priority in the scope of the full implementation project. As an end result, implementations are often delayed at a great cost to the organization while Data Migration issues are addressed. Informaticas suite of products provide functionality and process to minimize this cost of the migration, lower risk and increase the probability of success (i.e. completing the project on-time and on-budget). In this Best Practice we will discuss basic principles for data migration to lower the project time, to lower staff time to develop, lower risk and lower the total cost of ownership of the project. These principles include: 1. 2. 3. 4. 5. 6. 7. Leverage staging strategies Utilize table driven approaches Develop via Modular Design Focus On Re-Use Common Exception Handling Processes Multiple Simple Processes versus Few Complex Processes Take advantage of metadata

Description
Leverage Staging Strategies
As discussed elsewhere in Velocity, in data migration it is recommended to employ both a legacy staging and pre-load staging area. The reason for this is simple, it provides the ability to pull data from the production system and use it for data cleaning and harmonization activities without interfering with the production systems. By leveraging this type of strategy you are able to see real production data sooner and follow the guiding principle of Convert Early, Convert Often, and with Real Production Data'.

Utilize Table Driven Approaches


INFORMATICA CONFIDENTIAL BEST PRACTICES 216 of 954

Developers frequently find themselves in positions where they need to perform a large amount of cross-referencing, hard-coding of values, or other repeatable transformations during a Data Migration. These transformations often have a probability to change over time. Without a table driven approach this will cause code changes, bug fixes, re-testing, and re-deployments during the development effort. This work is unnecessary on many occasions and could be avoided with the use of configuration or reference data tables. It is recommend to use table driven approaches such as these whenever possible. Some common table driven approaches include:
q

Default Values hard-coded values for a given column, stored in a table where the values could be changed whenever a requirement changes. For example, if you have a hard coded value of NA for any value not populated and then want to change that value to NV you could simply change the value in a default value table rather then change numerous hard-coded values. Cross-Reference Values frequently in data migration projects there is a need to take values from the source system and convert them to the value of the target system. These values are usually identified up-front, but as the source system changes additional values are also needed. In a typical mapping development situation this would require adding additional values to a series of IIF or Decode statements. With a table driven situation, new data could be added to a cross-reference table and no coding, testing, or deployment would be required. Parameter Values by using a table driven parameter file you can reduce the need for scripting and accelerate the development process. Code-Driven Table in some instances a set of understood rules are known. By taking those rules and building code against them, a table-driven/code solution can be very productive. For example, if you had a rules table that was keyed by table/column/rule id, then whenever that combination was found a pre-set piece of code would be executed. If at a later date the rules change to a different set of pre-determined rules, the rule table could change for the column and no additional coding would be required.

Develop Via Modular Design


As part of the migration methodology, modular design is encouraged. Modular design is the act of developing a standard way of how similar mappings should function. These are then published as templates and developers are required to build similar mappings in that same manner. This provides rapid development, increases efficiency for testing, and increases ease of maintenance. The result of this change is it causes dramatically lower total cost of ownership and reduced cost.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

217 of 954

Focus On Re-Use
Re-use should always be considered during Informatica development. However, due to such a high degree of repeatability, on data migration projects re-use is paramount to success. There is often tremendous opportunity for re-use of mappings/strategies/ processes/scripts/testing documents. This reduces the staff time for migration projects and lowers project costs.

Common Exception Handling Processes


Employing the Velocity Data Migration Methodology through its iterative intent will add new data quality rules as problems are found with the data. Because of this it is critical to find data exceptions and write appropriate rules to correct these situations throughout the data migration effort. It is highly recommended to build a common method for capturing and recording these exceptions. This common method should then be deployed for all data migration processes.

Multiple Simple Processes versus Few Complex Processes


For data migration projects it is possible to build one process to pull all data for a given entity from all systems to the target system. While this may seem ideal, these type of complex processes take much longer to design and develop, are challenging to test, and are very difficult to maintain over time. Due to these drawbacks, it is recommend to develop many simple processes as needed to complete the effort rather then a few complex processes.

Take Advantage of Metadata


The Informatica data integration platform is highly metadata driven. Take advantage of those capabilities on data migration projects. This can be done via a host of reports against the data integration repository such as: 1. 2. 3. 4. 5. 6. Illustrate how the data is being transformed (i.e., lineage reports) Illustrate who has access to what data (i.e., security group reports) Illustrate what source or target objects exist in the repository Identify how many mappings each developer has created Identify how many sessions each developer has run during a given time period Identify how many successful/failed sessions have been executed

In summary, these design principles provide significant benefits to data migration

INFORMATICA CONFIDENTIAL

BEST PRACTICES

218 of 954

projects and add to the large set of typical best practice items that are available in Velocity. The key to Data Migration projects is architect well, design better, and execute best.
Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

219 of 954

Data Migration Project Challenges Challenge


A successful Data Migration effort is often critical to a system implementation. These implementations can be a new or upgraded ERP package, integration due to merger and acquisition activity, or the development of a new operational system. The effort and criticality of the Data Migration as part of the larger system implementation project is often overlooked, underestimated or given a lower priority in the scope of the full implementation project. As an end result, implementations are often delayed at a great cost to the organization while Data Migration issues are addressed. Informaticas suite of products provide functionality and process to minimize this cost of the migration, lower risk and increase the probability of success (i.e. completing the project on-time and on-budget). In this best practice the three main data migration project challenges will be discussed. These include: 1. Specifications incomplete, inaccurate, or not completed on-time. 2. Data quality problems impacting project time-lines. 3. Difficulties in project management executing the data migration project.

Description
Unlike other Velocity Best Practices we will not specify the full solution to each. Rather, it is more important to understand these three challenges and take action to address them throughout the implementation.

Migration Specifications
During the execution of data migration projects a challenge that projects always encounter is problems with migration specifications. Projects require the completion of functional specs to identify what is required of each migration interface. Definitions:
q

A migration interface is defined as 1 to many mapping/sessions/workflows or scripts used to migrate a data entity from one source system to one target system. A Functional Requirements Specification is normally comprised of a document covering details including security, database join needs, audit needs, and primary contact details. These details are normally at the interface level rather then at the column level. It also includes a Target-Source Matrix target-source matrix which identifies details at the column level such as how source table/columns map to target table/columns, business rules, data cleansing rules, validation rules, and other column level specifics.

Many projects attempt to complete these migrations without these types of specifications. Often these projects have little to no chance to complete on-time or on-budget. Time and subject matter expertise
INFORMATICA CONFIDENTIAL BEST PRACTICES 220 of 954

is needed to complete this analysis; this is the baseline for project success. Projects are disadvantaged when functional specifications are not completed on-time. Developers can often be in a wait mode for extended periods of time when these specs are not completed at the time specified by the project plan. Another project risk occurs when the right individuals are not used to write these specs or often inappropriate levels of importance are applied to this exercise. These situations cause inaccurate or incomplete specifications which prevent data integration developers from successfully building the migration processes. To address the spec challenge for migration projects, projects must have specifications that are completed with accuracy and delivered on time.

Data Quality
Most projects are affected by data quality due to the need to address problems in the source data that fit into the six dimensions of data quality:

Data Quality Dimension Completeness Conformity Consistency Accuracy Duplicates Integrity

Description

What data is missing or unusable? What data is stored in a non-standard format? What data values give conflicting Informatica? What data is incorrect or out of date? What data records or attributes are repeated? What data is missing or not referenced?

Data migration data quality problems are typically worse then planned for. Projects need to allow enough time to identify and fix data quality problems BEFORE loading the data into the new target system. Informaticas data integration platform provides data quality capabilities that can help to identify the data quality problems in an efficient manner, but Subject-Matter Experts are required to address how these data problems should be addressed within business context and process.

Project Management
INFORMATICA CONFIDENTIAL BEST PRACTICES 221 of 954

Project managers are often disadvantaged on these types of projects as they are mainly much larger, more expensive, and more complex then any prior project they have been involved with. They need to understand early in the project the importance of correctly completed specs and the importance of addressing data quality and establish a set of tools to accurately and objectively plan the project with the ability to evaluate progress. Informaticas Velocity Migration Methodology, its tool sets, and the metadata reporting capabilities are key to addressing these project challenges. The key challenge is to fully understand the pitfalls early on in the project and how PowerCenter and Informatica Data Quality can address these challenges, and how metadata reporting can provide objective information relative to project status. In summary, data migration projects are challenged by specification issues, data quality issues, and project management difficulties. By understanding the Velocity Methodology focus on data migration and how Informaticas products can handle these changes for a successful migration, these challenges can be minimized.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

222 of 954

Data Migration Velocity Approach Challenge


A successful Data Migration effort is often critical to a system implementation. These implementations can be a new or upgraded ERP package, integration due to merger and acquisition activity or the development of a new operational system. The effort and criticality of the Data Migration as part of the larger system implementation project is often overlooked, underestimated or given a lower priority in the scope of the full implementation project. As an end result, implementations are often delayed at a great cost to the organization while Data Migration issues are addressed. Informaticas suite of products provide functionality and process to minimize this cost of the migration, lower risk and increase the probability of success (i.e. completing the project on-time and on-budget). To meet these objectives a set of best practices have been provided in Velocity focused on Data Migration. This Best Practice provides an overview of how to use Informaticas Products in an iterative methodology to expedite a data migration project. The keys to the methodology are further discussed in the Best Practice Data Migration Principles.

Description
The Velocity approach to data migration is illustrated here. While it is possible to migrate data in one step it is more productive to break these processes up into two or three simpler steps. The goal for data migration is to get the data into the target application as early as possible for large scale implementations. Typical implementations will have three to four trial cutovers or mock-runs before the final implementation of Go-Live. The mantra for the Informatica based migration is to Convert Early, Convert Often, and Convert with Real Production Data. To do this the following approach is encouraged:

Analysis
In the analysis phase the functional specs will be completed, these will include both functional specs and target-source matrix. See the Best Practice Data Migration Project Challenges for related information.

Acquire
INFORMATICA CONFIDENTIAL BEST PRACTICES 223 of 954

In the acquire phase the targets-source matrix will be reviewed and all source systems/tables will be identified. These tables will be used to develop one mapping per source table to populate a mirrored structure in a legacy data based schema. For example if there were 50 source tables identified in all the Target-Source Matrix documents, 50 legacy tables would be created and one mapping would be developed; one for each table. It is recommended to perform the initial development against test data, but once complete run a single extract of the current production data. This will assist in addressing data quality problems without impacting production systems. It is recommended to run these extracts in low use time periods and with the cooperation of the operations group responsible for these systems. It is also recommended to take advantage of the Visio Generation Option if available. These mappings are very straight forward and the use of autogeneration can increase consistency and lower required staff time for the project.

Convert
In this phase data will be extracted from the legacy stage tables (merged, transformed, and cleansed) to populate a mirror of the target application. As part of this process a standard exception process should be developed to determine exceptions and expedite data cleansing activities. The results of this convert process should be profiled, and appropriate data quality scorecards should be reviewed. During the convert phase the basic set of exception tests should be executed, with exception details collected for future reporting and correction. The basic exception tests include: 1. 2. 3. 4. 5. Data Type Data Size Data Length Valid Values Range of Values

Exception Type Data Type

Exception Description Will the source data value load correctly to the target data type such as a numeric date loading into an Oracle date type? Will a numeric value from a source value load correctly to the target column or will a numeric overflow occur? Is the input value too large for the target column? (This is appropriate for all data types but of particular interest for string data types. For example, in one system a field could be char(256) but most of the values are char(10). In the target the new field is varchar(20) so any value over char (20) should raise an exception.) Is the input value within a tolerable range for the new system? (For example, does the birth date for an Insurance Subscriber fall between Jan 1, 1900 and Jan 1, 2006? If this test fails the date is unreasonable and should be addressed.)

Data Size

Data Length

Range of Values

INFORMATICA CONFIDENTIAL

BEST PRACTICES

224 of 954

Valid Values

Is the input value in a list of tolerant values in the target system? (An example of this would be does the state code for an input record match the list of states in the new target system? If not the data should be corrected prior to entry to the new system.)

Once profiling exercises, exception reports and data quality scorecards are complete a list of data quality issues should be created. This list should then be reviewed with the functional business owners to generate new data quality rules to correct the data. These details should be added to the spec and the original convert process should be modified with the new data quality rules. The convert process should then be re-executed as well as the profiling, exception reporting and data scorecarding until the data is correct and ready for load to the target application.

Migrate
In the migrate phase the data from the convert phase should be loaded to the target application. The expectation is that there should be no failures on these loads. The data should be corrected in the covert phase prior to loading the target application. Once the migrate phase is complete, validation should occur. It is recommended to complete an audit/balancing step prior to validation. This is discussed in the Best Practice Build Data Audit/Balancing Processes. Additional detail about these steps are defined in the Best Practice Data Migration Principles.
Last updated: 06-Feb-07 12:08

INFORMATICA CONFIDENTIAL

BEST PRACTICES

225 of 954

Build Data Audit/Balancing Processes Challenge


Data Migration and Data Integration projects are often challenged to verify that the data in an application is complete. More specifically, to identify that all the appropriate data was extracted from a source system and propagated to its final target. This best practice illustrates how to do this in an efficient and a repeatable fashion for increased productivity and reliability. This is particularly important in businesses that are either highly regulated internally and externally or that have to comply with a host of government compliance regulations such as Sarbanes-Oxley, BASEL II, HIPAA, Patriot Act, and many others.

Description
The common practice for audit and balancing solutions is to produce a set of common tables that can hold various control metrics regarding the data integration process. Ultimately, business intelligence reports provide insight at a glance to verify that the correct data has been pulled from the source and completely loaded to the target. Each control measure that is being tracked will require development of a corresponding PowerCenter process to load the metrics to the Audit/ Balancing Detail table. To drive out this type of solution execute the following tasks: 1. Work with business users to identify what audit/balancing processes are needed. Some examples of this may be: a. Customers (Number of Customers or Number of Customers by Country) b. Orders (Qty of Units Sold or Net Sales Amount) c. Deliveries (Number of shipments or Qty of units shipped of Value of all shipments) d. Accounts Receivable (Number of Accounts Receivable Shipments or Total Accounts Receivable Outstanding) 2. Define for each process defined in #1 which columns should be used for tracking purposes for both the source and target system. 3. Develop a data integration process that will read from the source system and populate the detail audit/balancing table with the control totals. 4. Develop a data integration process that will read from the target system and populate the detail audit/balancing table with the control totals. 5. Develop a reporting mechanism that will query the audit/balancing table and identify the the source and target entries match or if there is a discrepancy. An example audit/balance table definition looks like this : Audit/Balancing Details

INFORMATICA CONFIDENTIAL

BEST PRACTICES

226 of 954

Column Name AUDIT_KEY CONTROL_AREA

Data Type NUMBER VARCHAR2

Size 10 50 50 10 10 10 10 10

CONTROL_SUB_AREA VARCHAR2 CONTROL_COUNT_1 CONTROL_COUNT_2 CONTROL_COUNT_3 CONTROL_COUNT_4 CONTROL_COUNT_5 CONTROL_SUM_1 CONTROL_SUM_2 CONTROL_SUM_3 CONTROL_SUM_4 CONTROL_SUM_5 NUMBER NUMBER NUMBER NUMBER NUMBER

NUMBER (p,s) 10,2 NUMBER (p,s) 10,2 NUMBER (p,s) 10,2 NUMBER (p,s) 10,2 NUMBER (p,s) 10,2

UPDATE_TIMESTAMP TIMESTAMP UPDATE_PROCESS VARCHAR2 50

Control Column Definition by Control Area/Control Sub Area Column Name CONTROL_AREA Data Type Size

VARCHAR2 50

CONTROL_SUB_AREA VARCHAR2 50

INFORMATICA CONFIDENTIAL

BEST PRACTICES

227 of 954

CONTROL_COUNT_1 CONTROL_COUNT_2 CONTROL_COUNT_3 CONTROL_COUNT_4 CONTROL_COUNT_5 CONTROL_SUM_1 CONTROL_SUM_2 CONTROL_SUM_3 CONTROL_SUM_4 CONTROL_SUM_5

VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50

UPDATE_TIMESTAMP TIMESTAMP UPDATE_PROCESS VARCHAR2 50

The following is a screenshot of a single mapping that will populate both the source and target values in a single mapping:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

228 of 954

The following two screenshots show how two mappings could be used to provide the same results:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

229 of 954

Note: One key challenge is how to capture the appropriate control values from the source system if it is continually being updated. The first example with one mapping will not work due to the changes that occur in the time between the extraction of the data from the source and the completion of the load to the target application. In those cases you may want to take advantage of an aggregator transformation to collect the appropriate control totals as illustrated in this screenshot:

The following are two Straw-man Examples of an Audit/Balancing Report which is the end-result of this type of process: Data Area Leg count TT count Diff Leg amt TT amt Customer 11000 Orders 9827 10099 9827 1288 1 0 0 0 11230.21 11230.21 0 21294.22 21011.21 283.01

Deliveries 1298

In summary, there are two big challenges in building audit/balancing processes: 1. Identifying what the control totals should be 2. Building processes that will collect the correct information at the correct granularity There are also a set of basic tasks that can be leveraged and shared across any audit/balancing needs. By building a common model for meeting audit/balancing needs, projects can lower the time needed to develop these solutions and still provide risk reductions by having this type of solution in place.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

230 of 954

Continuing Nature of Data Quality Challenge


A data quality (DQ) project usually begins with a specific use case in mind; such as resolving data quality issues as a part of a data migration effort or attempting to reconcile data acquired as a part of a merger or acquisition. Regardless of the specific data quality need, planning for the data quality project should be considered an iterative process. As change will always be prevalent, data quality is not something that should be considered an absolute. An organization must be cognizant of the continuing nature of data quality whenever undertaking a project that involves data quality. The goal of this Best Practice is to set forth principles that outline the iterative nature of data quality and the steps that should be considered when planning a data quality initiative. Experience has shown that applying these principles and steps will maximize the potential for ongoing success in data quality projects.

Description
Reasons for considering data quality as an iterative process stems from two core concepts. First, the level of sophistication around data quality will continue to improve as a DQ process is implemented. Specifically, as the results are disseminated throughout the organization, it will become easier to make decisions on the types of rules and standards that should be implemented; as everyone will be working from a single view of the truth. Although everyone may not agree on how data is being entered or identified, the baseline analysis will identify the standards (or lack thereof) currently in place and provide a starting point to work from. Once the initial data quality process is implemented, the iterative nature begins. The users become more familiar with the data as they review the results of the data quality plans to standardize, cleanse and de-duplicate the data. As each iteration continues, the data stewards should determine if the business rules and reference dictionaries initially put into place need to be modified to effectively address any new issues that arise. The second reason that data quality continues to evolve is based on the premise that the data will not remain static. Although a baseline set of data quality rules will eventually be agreed upon, the assumption is that as soon as legacy data has been cleansed, standardized and de-duplicated it will ultimately change. This change could come from a user updating a record or a new data source being introduced that ultimately needs to become a part of the master data. In either case, the need to perform additional iterations on the updated records and/or new sources should be considered. The frequency of these iterations will vary and are ultimately driven by the processes for data entry and manipulation within an organization. This can result in anything from a need to cleanse data in realtime to possibly performing a nightly or weekly batch process. Regardless, scorecards should be monitored to determine if the business rules initially implemented need to be modified or if they are continuing to meet the needs of the organization as it pertains to data quality. The questions that should be considered when evaluating the continuing and iterative nature of data quality include:
q

Are the business rules and reference dictionaries meeting the needs of the organization when attempting to report on the underlying data?
BEST PRACTICES 231 of 954

INFORMATICA CONFIDENTIAL

If a new data source is introduced, can the same data quality rules be applied or do new rules need to be developed to meet the type of data found in this new source? From a trend perspective, is the quality of data improving over time? If not, what needs to be done to remedy the situation?

The answers to these questions will provide a framework to measure the current level of success achieved in implementing an iterative data quality initiative. Just as data quality should be viewed as iterative, so should these questions. They should be reflected upon frequently to determine if changes are needed to how data quality is implemented within the environment; or to the underlying business rules within a specific DQ process. Although the reasons to iterate through the data may vary, the following steps will be prevalent in each iteration: 1. Identify the problematic data element that needs to be addressed. This problematic data could include bad addresses, duplicate records or incomplete data elements as well as other examples. 2. Define the data quality rules and targets that need to be resolved. This includes rules for specific sources and content around which data quality areas are being addressed. 3. Design data quality plans to correct the problematic data. This could be one or many data quality plans, depending upon the scope and complexity of the source data. 4. Implement quality improvement processes to identify problematic data on an ongoing basis. These processes should detect data anomalies which could lead to known and unknown data problems. 5. Monitor and Repeat. This is done to ensure that the data quality plans correct the data to desired thresholds. Since data quality definitions can be adjusted based on business and data factors, this iterative review is essential to ensure that the stakeholders understand what will change with the data as it is cleansed and how that cleansed data may affect existing business process and management reporting.

Example of the Iterative Process

INFORMATICA CONFIDENTIAL

BEST PRACTICES

232 of 954

As noted in the above diagram, the iterative data quality process will continue to be leveraged within an organization as new master data is introduced. By having defined processes in place upfront, the ability to effectively leverage the data quality solution will be enhanced. An organizations departments that are charged with implementing and monitoring data quality will be doing so within the confines of the enterprise wide rules and procedures that have been identified for the organization. The following points should be considered as an expansion to the five steps noted above: 1. Identify & Measure Data Quality: This first point is key. The ability to understand the data within the confines of the six dimensions of data quality will form the foundation for the business rules and processes that will be put in place. Without performing an upfront assessment, the ability to effectively implement a data quality strategy will be negatively impacted. From an ongoing perspective, the data quality assessment will allow an organization to see how the data quality procedures put into place have caused the quality of the data to improve. Additionally, as new data enters the organization, the assessment will provide key information for making ongoing modifications to the data quality processes. 2. Define Data Quality Rules & Targets: Once the assessment is complete, the second part of the analysis phase involves scorecarding the results in order to put into place success criteria and metrics for the data quality management initiative. From an ongoing perspective, this phase will involve performing trend analysis on the data and the rules in place to ensure the data continues to conform to the rules that were put into place during the data quality management initiative. 3. Design Quality Improvement Processes: This phase involves the manipulation of the data to align it with the business rules put into place. Examples of potential improvements includestandardization, removing noise, aligning product attributes and implementing measures or classifications. 4. Implement Quality Improvement Processes: Once the data has been standardized, an adjunct to the enhancement process involves the identification of duplicate data and taking
INFORMATICA CONFIDENTIAL BEST PRACTICES 233 of 954

action based upon the business rules that have been identified. The rules to identify and address duplicate data will continue to evolve. This evolution occurs as data stewards become more familiar with the data and as the policies and procedures set in place by the data governance committee become widely adopted throughout the organization. As this occurs, the ability to find additional duplicates or the ability to find new relationships within the data begins to arise. 5. Monitor Data Quality versus Targets: The ability to monitor the data quality processes is critical as it provides the organization with a quick snapshot of the health of the data. Through analysis of the scorecard results, the data governance committee will have the information needed to confidently make additional modifications to the data quality strategies in place, if needed. Conversely, the scorecards and trend analysis results will provide the peace of mind that data quality is being effectively addressed within the organization.

Last updated: 20-May-08 22:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

234 of 954

Data Cleansing Challenge


Poor data quality is one of the biggest obstacles to the success of many data integration projects. A 2005 study by the Gartner Group stated that the majority of currently planned data warehouse projects will suffer limited acceptance or fail outright. Gartner declared that the main cause of project problems was a lack of attention to data quality. Moreover, once in the system, poor data quality can cost organizations vast sums in lost revenues. Defective data leads to breakdowns in the supply chain, poor business decisions, and inferior customer relationship management. It is essential that data quality issues are tackled during any large-scale data project to enable project success and future organizational success. Therefore, the challenge is twofold: to cleanse project data, so that the project succeeds, and to ensure that all data entering the organizational data stores provides for consistent and reliable decision-making.

Description
A significant portion of time in the project development process should be dedicated to data quality, including the implementation of data cleansing processes. In a production environment, data quality reports should be generated after each data warehouse implementation or when new source systems are integrated into the environment. There should also be provision for rolling back if data quality testing indicates that the data is unacceptable. Informatica offers two application suites for tackling data quality issues: Informatica Data Explorer (IDE) and Informatica Data Quality (IDQ). IDE focuses on data profiling, and its results can feed into the data integration process. However, its unique strength is its metadata profiling and discovery capability. IDQ has been developed as a data analysis, cleansing, correction, and de-duplication tool, one that provides a complete solution for identifying and resolving all types of data quality problems and preparing data for the consolidation and load processes.

Concepts
Following are some key concepts in the field of data quality. These data quality concepts provide a foundation that helps to develop a clear picture of the subject data, which can improve both efficiency and effectiveness. The list of concepts can be read as a process, leading from profiling and analysis to consolidation. Profiling and Analysis - whereas data profiling and data analysis are often synonymous terms, in Informatica terminology these tasks are assigned to IDE and IDQ respectively. Thus, profiling is primarily concerned with metadata discovery and definition, and IDE is ideally suited to these tasks. IDQ can discover data quality issues at a record and field level, and Velocity best practices recommends the use of IDQ for such purposes. Note: The remaining items in this document will therefore, focus in the context of IDQ usage.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

235 of 954

Parsing - the process of extracting individual elements within the records, files, or data entry forms in order to check the structure and content of each field and to create discrete fields devoted to specific information types. Examples may include: name, title, company name, phone number, and SSN. Cleansing and Standardization - refers to arranging information in a consistent manner or preferred format. Examples include the removal of dashes from phone numbers or SSNs. For more information, see the Best Practice Effective Data Standardizing Techniques. Enhancement - refers to adding useful, but optional, information to existing data or complete data. Examples may include: sales volume, number of employees for a given business, and zip+4 codes. Validation - the process of correcting data using algorithmic components and secondary reference data sources, to check and validate information. Example: validating addresses with postal directories. Matching and de-duplication - refers to removing, or flagging for removal, redundant or poor-quality records where high-quality records of the same information exist. Use matching components and business rules to identify records that may refer, for example, to the same customer. For more information, see the Best Practice Effective Data Matching Techniques. Consolidation - using the data sets defined during the matching process to combine all cleansed or approved data into a single, consolidated view. Examples are building best record, master record, or house-holding.

Informatica Applications
The Informatica Data Quality software suite has been developed to resolve a wide range of data quality issues, including data cleansing. The suite comprises the following elements:
q

IDQ Workbench - a stand-alone desktop tool that provides a complete set of data quality functionality on a single computer (Windows only).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

236 of 954

IDQ Server- a set of processes that enables the deployment and management of data quality procedures and resources across a network of any size through TCP/IP. IDQ Integration - a plug-in component that integrates Workbench with PowerCenter, enabling PowerCenter users to embed data quality procedures defined in IDQ in their mappings. IDQ stores all its processes as XML in the Data Quality Repository (MySQL). IDQ Server enables the creation and management of multiple repositories.

Using IDQ in Data Projects


IDQ can be used effectively alongside PowerCenter in data projects, to run data quality procedures in its own applications or to provide them for addition to PowerCenter transformations. Through its Workbench user-interface tool, IDQ tackles data quality in a modular fashion. That is, Workbench enables you to build discrete procedures (called plans in Workbench) which contain data input components, output components, and operational components. Plans can perform analysis, parsing, standardization, enhancement, validation, matching, and consolidation operations on the specified data. Plans are saved into projects that can provide a structure and sequence to your data quality endeavors. The following figure illustrates how data quality processes can function in a project setting:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

237 of 954

In stage 1, you analyze the quality of the project data according to several metrics, in consultation with the business or project sponsor. This stage is performed in Workbench, which enables the creation of versatile and easy to use dashboards to communicate data quality metrics to all interested parties. In stage 2, you verify the target levels of quality for the business according to the data quality measurements taken in stage 1, and in accordance with project resourcing and scheduling. In stage 3, you use Workbench to design the data quality plans and projects to achieve the targets. Capturing business rules and testing the plans are also covered in this stage. In stage 4, you deploy the data quality plans. If you are using IDQ Workbench and Server, you can deploy plans and resources to remote repositories and file systems through the user interface. If you are running Workbench alone on remote computers, you can export your plans as XML. Stage 4 is the phase in which data cleansing and other data quality tasks are performed on the project data. In stage 5, youll test and measure the results of the plans and compare them to the initial data quality assessment to verify that targets have been met. If targets have not been met, this information feeds into another iteration of data quality operations in which the plans are tuned and optimized. In a large data project, you may find that data quality processes of varying sizes and impact are necessary at many points in the project plan. At a high level, stages 1 and 2 ideally occur very early in the project, at a point defined as the Manage Phase within Velocity. Stages 3 and 4 typically occur during the Design Phase of Velocity. Stage 5 can occur during the Design and/or Build Phase of Velocity, depending on the level of unit testing required.

Using the IDQ Integration

INFORMATICA CONFIDENTIAL

BEST PRACTICES

238 of 954

Data Quality Integration is a plug-in component that enables PowerCenter to connect to the Data Quality repository and import data quality plans to a PowerCenter transformation. With the Integration component, you can apply IDQ plans to your data without necessarily interacting with or being aware of IDQ Workbench or Server. The Integration interacts with PowerCenter in two ways:
q

On the PowerCenter client side, it enables you to browse the Data Quality repository and add data quality plans to custom transformations. The data quality plans functional details are saved as XML in the PowerCenter repository. On the PowerCenter server side, it enables the PowerCenter Server (or Integration service) to send data quality plan XML to the Data Quality engine for execution.

The Integration requires that at least the following IDQ components are available to PowerCenter:
q q

Client side: PowerCenter needs to access a Data Quality repository from which to import plans. Server side: PowerCenter needs an instance of the Data Quality engine to execute the plan instructions.

An IDQ-trained consultant can build the data quality plans, or you can use the pre-built plans provided by Informatica. Currently, Informatica provides a set of plans dedicated to cleansing and de-duplicating North American name and postal address records. The Integration component enables the following process:
q

Data quality plans are built in Data Quality Workbench and saved from there to the Data Quality repository. The PowerCenter Designer user opens a Data Quality Integration transformation and configures it to read from the Data Quality repository. Next, the users selects a plan from the Data Quality repository and adds it to the transformation. The PowerCenter Designer user saves the transformation and the mapping containing it to the PowerCenter repository. The plan information is saved with the transformation as XML.

The PowerCenter Integration service can then run a workflow containing the saved mapping. The relevant source data and plan information will be sent to the Data Quality engine, which processes the data (in conjunction with any reference data files used by the plan) and returns the results to PowerCenter.

Last updated: 06-Feb-07 12:43

INFORMATICA CONFIDENTIAL

BEST PRACTICES

239 of 954

Data Profiling Challenge


Data profiling is an option in PowerCenter version 7.0 and later that leverages existing PowerCenter functionality and a data profiling GUI front-end to provide a wizard-driven approach to creating data profiling mappings, sessions, and workflows. This Best Practice is intended to provide an introduction on usage for new users. Bear in mind that Informaticas Data Quality (IDQ) applications also provide data profiling capabilities. Consult the following Velocity Best Practice documents for more information:
q

Data Cleansing Using Data Explorer for Data Discovery and Analysis

Description
Creating a Custom or Auto Profile
The data profiling option provides visibility into the data contained in source systems and enables users to measure changes in the source data over time. This information can help to improve the quality of the source data. An auto profile is particularly valuable when you are data profiling a source for the first time, since auto profiling offers a good overall perspective of a source. It provides a row count, candidate key evaluation, and redundancy evaluation at the source level, and domain inference, distinct value and null value count, and min, max, and average (if numeric) at the column level. Creating and running an auto profile is quick and helps to gain a reasonably thorough understanding of a source in a short amount of time. A custom data profile is useful when there is a specific question about a source. Custom profiling is useful for validating business rules and/or verifying that data matches a particular pattern. For example, use custom profiling if you have a business rule that you want to validate, or if you want to test whether data matches a particular pattern.

Setting Up the Profile Wizard


To customize the profile wizard for your preferences:
q q

Open the Profile Manager and choose Tools > Options. If you are profiling data using a database user that is not the owner of the tables to be sourced, check the Use source owner name during profile mapping generation option. If you are in the analysis phase of your project, choose Always run profile interactively since most of your dataprofiling tasks will be interactive. (In later phases of the project, uncheck this option because more permanent data profiles are useful in these phases.)

Running and Monitoring Profiles


Profiles are run in one of two modes: interactive or batch. Choose the appropriate mode by checking or unchecking Configure Session on the "Function-Level Operations tab of the wizard.
q

Use Interactive to create quick, single-use data profiles. The sessions are created with default configuration parameters. For data-profiling tasks that are likely to be reused on a regular basis, create the sessions manually in Workflow Manager and configure and schedule them appropriately.

Generating and Viewing Profile Reports


Use Profile Manager to view profile reports. Right-click on a profile and choose View Report.
INFORMATICA CONFIDENTIAL BEST PRACTICES 240 of 954

For greater flexibility, you can also use Data Analyzer to view reports. Each PowerCenter client includes a Data Analyzer schema and reports xml file. The xml files are located in the \Extensions\DataProfile\IPAReports subdirectory of the client installation. You can create additional metrics, attributes, and reports in Data Analyzer to meet specific business requirements. You can also schedule Data Analyzer reports and alerts to send notifications in cases where data does not meet preset quality limits.

Sampling Techniques
Four types of sampling techniques are available with the PowerCenter data profiling option:

Technique No sampling

Description Uses all source data

Usage Relatively small data sources

Automatic random sampling PowerCenter determines the Larger data sources where you appropriate percentage to sample, then want a statistically significant data samples random rows. analysis Manual random sampling PowerCenter samples random rows of the source data based on a userspecified percentage. Samples more or fewer rows than the automatic option chooses.

Sample first N rows

Samples the number of user-selected rows

Provides a quick readout of a source (e.g., first 200 rows)

Profile Warehouse Administration Updating Data Profiling Repository Statistics


The Data Profiling repository contains nearly 30 tables with more than 80 indexes. To ensure that queries run optimally, be sure to keep database statistics up to date. Run the query below as appropriate for your database type, then capture the script that is generated and run it.

ORACLE
select 'analyze table ' || table_name || ' compute statistics;' from user_tables where table_name like 'PMDP%'; select 'analyze index ' || index_name || ' compute statistics;' from user_tables where index_name like 'DP%';

Microsoft SQL Server


select 'update statistics ' + name from sysobjects where name like 'PMDP%'

SYBASE
select 'update statistics ' + name from sysobjects where name like 'PMDP%'

INFORMIX

INFORMATICA CONFIDENTIAL

BEST PRACTICES

241 of 954

select 'update statistics low for table ', tabname, ' ; ' from systables where table_name like 'PMDP%'

IBM DB2
select 'runstats on table ' || rtrim(tabschema) || '. ' || tabname || ' and indexes all; ' from syscat.tables where tabname like 'PMDP %'

TERADATA
select 'collect statistics on ', tablename, ' index ', indexname from dbc.indices where tablename like 'PMDP%' and databasename = 'database_name' where database_name is the name of the repository database.

Purging Old Data Profiles


Use the Profile Manager to purge old profile data from the Profile Warehouse. Choose Target Warehouse>Connect and connect to the profiling warehouse. Choose Target Warehouse>Purge to open the purging tool.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

242 of 954

Data Quality Mapping Rules Challenge


Use PowerCenter to create data quality mapping rules to enhance the usability of the data in your system.

Description
The issue of poor data quality is one that frequently hinders the success of data integration projects. It can produce inconsistent or faulty results and ruin the credibility of the system with the business users. This Best Practice focuses on techniques for use with PowerCenter and third-party or add-on software. Comments that are specific to the use of PowerCenter are enclosed in brackets. Bear in mind that you can augment or supplant the data quality handling capabilities of PowerCenter with Informatica Data Quality (IDQ), the Informatica application suite dedicated to data quality issues. Data analysis and data enhancement processes, or plans, defined in IDQ can deliver significant data quality improvements to your project data. A data project that has built-in data quality steps, such as those described in the Analyze and Design phases of Velocity, enjoys a significant advantage over a project that has not audited and resolved issues of poor data quality. If you have added these data quality steps to your project, you are likely to avoid the issues described below. A description of the range of IDQ capabilities is beyond the scope of this document. For a summary of Informaticas data quality methodology, as embodied in IDQ, consult the Best Practice Data Cleansing.

Common Questions to Consider


Data integration/warehousing projects often encounter general data problems that may not merit a full-blown data quality project, but which nonetheless must be addressed. This document discusses some methods to ensure a base level of data quality; much of the content discusses specific strategies to use with PowerCenter. The quality of data is important in all types of projects, whether it be data warehousing,

INFORMATICA CONFIDENTIAL

BEST PRACTICES

243 of 954

data synchronization, or data migration. Certain questions need to be considered for all of these projects, with the answers driven by the projects requirements and the business users that are being serviced. Ideally, these questions should be addressed during the Design and Analyze Phases of the project because they can require a significant amount of re-coding if identified later. Some of the areas to consider are:

Text Formatting
The most common hurdle here is capitalization and trimming of spaces. Often, users want to see data in its raw format without any capitalization, trimming, or formatting applied to it. This is easily achievable as it is the default behavior, but there is danger in taking this requirement literally since it can lead to duplicate records when some of these fields are used to identify uniqueness and the system is combining data from various source systems. One solution to this issue is to create additional fields that act as a unique key to a given table, but which are formatted in a standard way. Since the raw data is stored in the table, users can still see it in this format, but the additional columns mitigate the risk of duplication. Another possibility is to explain to the users that raw data in unique, identifying fields is not as clean and consistent as data in a common format. In other words, push back on this requirement. This issue can be particularly troublesome in data migration projects where matching the source data is a high priority. Failing to trim leading/trailing spaces from data can often lead to mismatched results since the spaces are stored as part of the data value. The project team must understand how spaces are handled from the source systems to determine the amount of coding required to correct this. (When using PowerCenter and sourcing flat files, the options provided while configuring the File Properties may be sufficient.). Remember that certain RDBMS products use the data type CHAR, which then stores the data with trailing blanks. These blanks need to be trimmed before matching can occur. It is usually only advisable to use CHAR for 1-character flag fields.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

244 of 954

Note that many fixed-width files do not use a null as space. Therefore, developers must put one space beside the text radio button, and also tell the product that the space is repeating to fill out the rest of the precision of the column. The strip trailing blanks facility then strips off any remaining spaces from the end of the data value. Embedding database text manipulation functions in lookup transformations is not recommended because a developer must then cache the lookup table due to the presence of a SQL override. (In PowerCenter, avoid embedding database text manipulation functions in lookup transformations.) On very large tables, caching is not always realistic or feasible.

Datatype Conversions
It is advisable to use explicit tool functions when converting the data type of a particular data value. [In PowerCenter, if the TO_CHAR function is not used, an implicit conversion is performed, and 15 digits are carried forward, even when they are not needed or desired. PowerCenter can handle some conversions without function calls (these are detailed in the product documentation), but this may cause subsequent support or
INFORMATICA CONFIDENTIAL BEST PRACTICES 245 of 954

maintenance headaches.]

Dates
Dates can cause many problems when moving and transforming data from one place to another because an assumption must be made that all data values are in a designated format. [Informatica recommends first checking a piece of data to ensure it is in the proper format before trying to convert it to a Date data type. If the check is not performed first, then a developer increases the risk of transformation errors, which can cause data to be lost]. An example piece of code would be: IIF(IS_DATE(in_RECORD_CREATE_DT, YYYYMMDD'), TO_DATE(in_RECORD_CREATE_DT, 'YYYYMMDD'), NULL) If the majority of the dates coming from a source system arrive in the same format, then it is often wise to create a reusable expression that handles dates, so that the proper checks are made. It is also advisable to determine if any default dates should be defined, such as a low date or high date. These should then be used throughout the system for consistency. However, do not fall into the trap of always using default dates as some are meant to be NULL until the appropriate time (e.g., birth date or death date). The NULL in the example above could be changed to one of the standard default dates described here.

Decimal Precision
With numeric data columns, developers must determine the expected or required precisions of the columns. (By default, to increase performance, PowerCenter treats all numeric columns as 15 digit floating point decimals, regardless of how they are defined in the transformations. The maximum numeric precision in PowerCenter is 28 digits.) If it is determined that a column realistically needs a higher precision, then the Enable Decimal Arithmetic in the Session Properties option needs to be checked. However, be aware that enabling this option can slow performance by as much as 15 percent. The Enable Decimal Arithmetic option must be enabled when comparing two numbers for equality.

Trapping Poor Data Quality Techniques

INFORMATICA CONFIDENTIAL

BEST PRACTICES

246 of 954

The most important technique for ensuring good data quality is to prevent incorrect, inconsistent, or incomplete data from ever reaching the target system. This goal may be difficult to achieve in a data synchronization or data migration project, but it is very relevant when discussing data warehousing or ODS. This section discusses techniques that you can use to prevent bad data from reaching the system.

Checking Data for Completeness Before Loading


When requesting a data feed from an upstream system, be sure to request an audit file or report that contains a summary of what to expect within the feed. Common requests here are record counts or summaries of numeric data fields. If you have performed a data quality audit, as specified in the Analyze Phase these metrics and others should be readily available. Assuming that the metrics can be obtained from the source system, it is advisable to then create a pre-process step that ensures your input source matches the audit file. If the values do not match, stop the overall process from loading into your target system. The source system can then be alerted to verify where the problem exists in its feed.

Enforcing Rules During Mapping


Another method of filtering bad data is to have a set of clearly defined data rules built into the load job. The records are then evaluated against these rules and routed to an Error or Bad Table for further re-processing accordingly. An example of this is to check all incoming Country Codes against a Valid Values table. If the code is not found, then the record is flagged as an Error record and written to the Error table. A pitfall of this method is that you must determine what happens to the record once it has been loaded to the Error table. If the record is pushed back to the source system to be fixed, then a delay may occur until the record can be successfully loaded to the target system. In fact, if the proper governance is not in place, the source system may refuse to fix the record at all. In this case, a decision must be made to either: 1) fix the data manually and risk not matching with the source system; or 2) relax the business rule to allow the record to be loaded. Often times, in the absence of an enterprise data steward, it is a good idea to assign a team member the role of data steward. It is this persons responsibility to patrol these tables and push back to the appropriate systems as necessary, as well as help to make decisions about fixing or filtering bad data. A data steward should have a good command of the metadata, and he/she should also understand the consequences to the user community of data decisions.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

247 of 954

Another solution applicable in cases with a small number of code values is to try to anticipate any mistyped error codes and translate them back to the correct codes. The cross-reference translation data can be accumulated over time. Each time an error is corrected, both the incorrect and correct values should be put into the table and used to correct future errors automatically.

Dimension Not Found While Loading Fact


The majority of current data warehouses are built using a dimensional model. A dimensional model relies on the presence of dimension records existing before loading the fact tables. This can usually be accomplished by loading the dimension tables before loading the fact tables. However, there are some cases where a corresponding dimension record is not present at the time of the fact load. When this occurs, consistent rules need to handle this so that data is not improperly exposed to, or hidden from, the users. One solution is to continue to load the data to the fact table, but assign the foreign key a value that represents Not Found or Not Available in the dimension. These keys must also exist in the dimension tables to satisfy referential integrity, but they provide a clear and easy way to identify records that may need to be reprocessed at a later date. Another solution is to filter the record from processing since it may no longer be relevant to the fact table. The team will most likely want to flag the row through the use of either error tables or process codes so that it can be reprocessed at a later time. A third solution is to use dynamic caches and load the dimensions when a record is not found there, even while loading the fact table. This should be done very carefully since it may add unwanted or junk values to the dimension table. One occasion when this may be advisable is in cases where dimensions are simply made up of the distinct combination values in a data set. Thus, this dimension may require a new record if a new combination occurs. It is imperative that all of these solutions be discussed with the users before making any decisions since they will eventually be the ones making decisions based on the reports.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

248 of 954

Data Quality Project Estimation and Scheduling Factors Challenge


This Best Practice is intended to assist project managers who must estimate the time and resources necessary to address data quality issues within data integration or other data-dependent projects. Its primary concerns are the project estimation issues that arise when you add a discrete data quality stage to your data project. However, it also examines the factors that determine when, or whether, you need to build a larger data quality element into your project.

Description
At a high level, there are three ways to add data quality to your project:
q

Add a discrete and self-contained data quality stage, such as that enabled by using pre-built Informatica Data Quality (IDQ) processes, or plans, in conjunction with Informatica Data Cleanse and Match. Add an expanded but finite set of data quality actions to the project, for example in cases where pre-built plans do not fit the project parameters. Incorporate data quality actions throughout the project.

This document should help you decide which of these methods best suits your project and assist in estimating the time and resources needed for the first and second methods.

Using Pre-Built Plans with Informatica Data Cleanse and Match


Informatica Data Cleanse and Match is a cross-application solution that enables PowerCenter users to add data quality processes defined in IDQ to custom transformations in PowerCenter. It incorporates the following components:
q

Data Quality Workbench, a user-interface application for building and executing data quality processes, or plans. Data Quality Integration, a plug-in component for PowerCenter that integrates PowerCenter and IDQ. At least one set of reference data files that can be read by data quality plans to validate and enrich certain types of project data. For example, Data Cleanse and Match can be used with the North America Content Pack, which includes pre-built data quality plans and complete address reference datasets for the United States and Canada.

Data Quality Engagement Scenarios


Data Cleanse and Match delivers its data quality capabilities out of the box; a PowerCenter user can select data quality plans and add them to a Data Quality transformation without leaving PowerCenter. In this way, Data Cleanse and Match capabilities can be added into a project plan as a relatively short and
INFORMATICA CONFIDENTIAL BEST PRACTICES 249 of 954

discrete stage. In a more complex scenario, a Data Quality Developer may wish to modify the underlying data quality plans or create new plans to focus on quality analysis or enhancements in particular areas. This expansion of the data quality operations beyond the pre-built plans can also be handled within a discrete data quality stage. The Project Manager may decide to implement a more thorough approach to data quality and integrate data quality actions throughout the project plan. In many cases, a convincing case can be made for enlarging the data quality aspect to encompass the full data project. (Velocity contains several tasks and subtasks concerned with such an endeavor.) This is well worth considering. Often, businesses do not realize the extent to which their business and project goals depend on the quality of their data. The project impact of these three types of data quality activity can be summarized as follows:

DQ approach Simple stage Expanded data quality stage

Estimated Project impact 10 days, 1-2 Data Quality Developers 15-20 days, 2 Data Quality Developers, high visibility to business

Data quality integrated with data project Duration of data project, 2 or more project roles, impact on business and project objectives
Note: The actual time that should be allotted to the data quality stages noted above depends on the factors discussed in the remainder of this document.

Factors Influencing Project Estimation


The factors influencing project estimation for a data quality stage range from high-level project parameters to lower-level data characteristics. The main factors are listed below and explained in detail later in this document.
q q q q q q q q

Base and target levels of data quality Overall project duration/budget Overlap of sources/Complexity of data joins Quantity of data sources Matching requirements Data volumes Complexity and quantity of data rules Geography

Determine which scenario out of the box (Data Cleanse and Match), expanded Data Cleanse and Match, or a thorough data quality integration best fits your data project by considering the projects overall objectives and its mix of factors.

The Simple Data Quality Stage


INFORMATICA CONFIDENTIAL BEST PRACTICES 250 of 954

Project managers can consider the use of pre-built plans with Data Cleanse and Match as a simple scenario with a predictable number of function points that can be added to the project plan as a single package. You can add the North America Content Pack plans to your project if the project meets most of the following criteria. Similar metrics apply to other types of pre-built plans:
q q q q q q q q q

Baseline functionality of the pre-built data quality plans meets 80 percent of the project needs. Complexity of data rules is relatively low. Business rules present in pre-built plans need minimum fine-tuning. Target data quality level is achievable (i.e., <100 percent). Quantity of data sources is relatively low. Overlap of data sources/complexity of database table joins is relatively low. Matching requirements and targets are straightforward. Overall project duration is relatively short. The project relates to a single country.

Note that the source data quality level is not a major concern.

Implementing the Simple Data Quality Stage


The out-of-the-box scenario is designed to deliver significant increases in data quality in those areas for which the plans were designed (i.e., North American name and address data) in a short time frame. As indicated above, it does not anticipate major changes to the underlying data quality plans. It involves the following three steps: 1. Run pre-built plans. 2. Review plan results. 3. Transfer data to the next stage in the project and (optionally) add data quality plans to PowerCenter transformations. While every project is different, a single iteration of the simple model may take approximately five days, as indicated below:
q q q

Run pre-built plans (2 days) Review plan results (1 day) Pass data to the next stage in the project and add plans to PowerCenter transformations (2 days)

Note that these estimates fit neatly into a five-day week but may be conservative in some cases. Note also that a Data Quality Developer can tune plans on an ad-hoc basis to suit the project. Therefore you should plan for a two week simple data quality stage.

Step - Simple Stage Run pre-built plans

Days, week 1 2

Days, week 2

INFORMATICA CONFIDENTIAL

BEST PRACTICES

251 of 954

Review plan results Fine-tune pre-built plans if necessary Re-run pre-built plans Review plan results with stakeholders Add plans to PowerCenter transformations and define mappings Run PowerCenter workflows Review results/obtain approval from stakeholders Approve and pass all files to the next project stage

1 2 2 1 1 1

Expanding the Simple Data Quality Stage


Although the simple scenario above allows for the data quality components to be treated as a black box, it allows for modifications to the data quality plans. The types of plan tuning that developers can undertake in this time frame include changing the reference dictionaries used by the plans, editing these dictionaries, and re-selecting the data fields used by the plans as keys to identify data matches. The above time frame does not guarantee that a developer can build or re-build a plan from scratch. The gap between base and target levels of data quality is an important area to consider when expanding the data quality stage. The Developer and Project Manager may decide to add a data analysis step in this stage, or even decide to split these activities across the project plan by conducting a data quality audit early in the project, so that issues can be revealed to the business in advance of the formal data quality stage. The schedule should allow for sufficient time for testing the data quality plans and for contact with the business managers in order to define data quality expectations and targets. In addition:
q

If a data quality audit is added early in the project, the data quality stage grows into a projectlength endeavor. If the data quality audit is included in the discrete data quality stage, the expanded, three-week Data Quality stage may look like this:

Step - Enhanced DQ Stage Set up and run data analysis plans Review plan results Conduct advance tuning of pre-built plans Run pre-built plans Review plan results with stakeholders Modify pre-built plans or build new plans from scratch Re-run the plans

Days, week 1 1-2 2

Days, week 2

Days, week 3

1 2 2

INFORMATICA CONFIDENTIAL

BEST PRACTICES

252 of 954

Review plan results/obtain approval from stakeholders Add approved plans to PowerCenter transformations, define mappings Run PowerCenter workflows Review results/obtain approval from stakeholders Approve and pass all files to the next project stage

1 2 1 1 1

Sizing Your Data Quality Initiatives


The following section describes the factors that affect the estimated time that the data quality endeavors may add to a project. Estimating the specific impact that a single factor is likely to have on a project plan is difficult, as a single data factor rarely exists in isolation from others. If one or two of these factors apply to your data, you may be able to treat them within the scope of a discrete DQ stage. If several factors apply, you are moving into a complex scenario and must design your project plan accordingly.

Base and Target Levels of Data Quality


The rigor of your data quality stage depends in large part on the current (i.e., base) levels of data quality in your dataset and the target levels that you want to achieve. As part of your data project, you should run a set of data analysis plans and determine the strengths and weaknesses of the proposed project data. If your data is already of a high quality relative to project and business goals, then your data quality stage is likely to be a short one! If possible, you should conduct this analysis at an early stage in the data project (i.e., well in advance of the data quality stage). Depending on your overall project parameters, you may have already scoped a Data Quality Audit into your project. However, if your overall project is short in duration, you may have to tailor your data quality analysis actions to the time available. Action:If there is a wide gap between base and target data quality levels, determine whether a short data quality stage can bridge the gap. If a data quality audit is conducted early in the project, you have latitude to discuss this with the business managers in the context of the overall project timeline. In general, it is good practice to agree with the business to incorporate time into the project plan for a dedicated Data Quality Audit. (See Task 2.8 in theVelocity Work Breakdown Structure.) If the aggregated data quality percentage for your projects source data is greater than 60 percent, and your target percentage level for the data quality stage is less than 95 percent, then you are in the zone of effectiveness for Data Cleanse and Match. Note: You can assess data quality according to at least six criteria. Your business may need to improve data quality levels with respect to one criterion but not another. See the Best Practice document Data Cleansing .

Overall Project Duration/Budget

INFORMATICA CONFIDENTIAL

BEST PRACTICES

253 of 954

A data project with a short duration may not have the means to accommodate a complex data quality stage, regardless of the potential or need to enhance the quality of the data involved. In such a case, you may have to incorporate a finite data quality stage. Conversely, a data project with a long time line may have scope for a larger data quality initiative. In large data projects with major business and IT targets, good data quality may be a significant issue. For example, poor data quality can affect the ability to cleanly and quickly load data into target systems. Major data projects typically have a genuine need for high-quality data if they are to avoid unforeseen problems. Action: Evaluate the project schedule parameters and expectations put forward by the business and evaluate how data quality fits into these parameters. You must also determine if there are any data quality issues that may jeopardize project success, such as a poor understanding of the data structure. These issues may already be visible to the business community. If not, they should be raised with the management. Bear in mind that data quality is not simply concerned with the accuracy of the data values it can encompass the project metadata also.

Overlap of Sources/Complexity of Data Joins


When data sources overlap, data quality issues can be spread across several sources. The relationships among the variables within the sources can be complex, difficult to join together, and difficult to resolve, all adding to project time. If the joins between the data are simple, then this task may be straightforward. However, if the data joins use complex keys or exist over many hierarchies, then the data modeling stage can be time-consuming, and the process of resolving the indices may be prolonged. Action: You can tackle complexity in data sources and in required database joins within a data quality stage, but in doing so, you step outside the scope of the simple data quality stage.

Quantity of Data Sources


This issue is similar to that of data source overlap and complexity (above). The greater the quantity of sources, the greater the opportunity for data quality issues to arise. The number of data sources has a particular impact on the time required to set up the data quality solution. (The source data setup in PowerCenter can facilitate the data setup in the data quality stage.) Action: You may find that the number of data sources correlates with the number of data sites covered by the project. If your project includes data from multiple geographies, you step outside the scope of a simple data quality stage.

Matching Requirements
Data matching plans are the most performance-intensive type of data quality plan. Moreover, matching plans are often coupled to a type of data standardization plan (i.e., grouping plan) that prepares the data for match analysis. Matching plans are not necessarily more complex to design than other types of plans, although they may contain sophisticated business rules. However, the time taken to execute a matching plan is exponentially
INFORMATICA CONFIDENTIAL BEST PRACTICES 254 of 954

proportional to the volume of data records passed through the plan. (Specifically, the time taken is proportional to the size and number of data groups created in the grouping plans.) Action: Consult the Best Practice on Effective Data Matching Techniques and determine how long your matching plans may take to run.

Data Volumes
Data matching requirements and data volumes are closely related. As stated above, the time taken to execute a matching plan is exponentially proportional to the volume of data records passed through it. In other types of plans, this exponential relationship does not exist. However, the general rule applies: the larger your data volumes, the longer it takes for plans to execute. Action: Although IDQ can handle data volumes measurable in eight figures, a dataset of more than 1.5 million records is considered larger than average. If your dataset is measurable in millions of records, and high levels of matching/de-duplication are required, consult the Best Practice on Effective Data Matching Techniques.

Complexity and Quantity of Data Rules


This is a key factor in determining the complexity of your data quality stage. If the Data Quality Developer is likely to write a large number of business rules for the data quality plans as may be the case if data quality target levels are very high or relate to precise data objectives then the project is de facto moving out of Data Cleanse and Match capability and you need to add rule-creation and rule-review elements to the data quality effort. Action: If the business requires multiple complex rules, you must scope additional time for rule creation and for multiple iterations of the data quality stage. Bear in mind that, as well as writing and adding these rules to data quality plans, the rules must be tested and passed by the business.

Geography
Geography affects the project plan in two ways:
q

First, the geographical spread of data sites is likely to affect the time needed to run plans, collate data, and engage with key business personnel. Working hours in different time zones can mean that one site is starting its business day while others are ending theirs, and this can effect the tight scheduling of the simple data quality stage. Secondly, project data that is sourced from several countries typically means multiple data sources, with opportunities for data quality issues to arise that may be specific to the country or the division of the organization providing the data source.

There is also a high correlation between the scale of the data project and the scale of the enterprise in which the project will take place. For multi-national corporations, there is rarely such a thing as a small data project! Action: Consider the geographical spread of your source data. If the data sites are spread across several time zones or countries, you may need to factor in time lags to your data quality planning.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

255 of 954

Developing the Data Quality Business Case Challenge


When a potential data quality issue has been identified it is imperative to develop a business case that details the severity of the issue along with the benefits to be gained by implementing a data quality strategy. A strong business case can help to build the necessary organizational support for funding a data quality initiative.

Description
Building a business case around data quality often necessitates starting with a pilot project. The purpose of the pilot project is to document the anticipated return on investment (ROI). It is important to ensure that the pilot is both manageable and achievable in a relatively short period of time. Build the business case by conducting a Data Quality Audit on a representative sample set of data, but set a reasonable scope so that the audit can be accomplished within a three to four week period. At the conclusion of the Data Quality Audit a report should be prepared that captures the results of the investigation (i.e., invalid data, duplicate records, etc.) and extrapolates the expected cost savings that can be gained if an Enterprise data quality initiative is pursued. Below are the five key steps necessary to develop a business case for a Data Quality Audit. Following these steps also provides a solid foundation for detailing the business requirements for an Enterprise data quality initiative. 1. Identify a Test Source a. What source files (s) are to be considered? A representative sample set of data should be evaluated. This can be a crosssection of an enterprise data set or data from a specific department in which a potential data quality issue is expected to be found. b. What data within those files (priority, obsolete, dormant, incorrect) will be used?

INFORMATICA CONFIDENTIAL

BEST PRACTICES

256 of 954

Prior to conducting the Data Quality Audit, the type of data within each file should be documented. The results generated during the Audit should be tracked against the anticipated data types. For example, if 10% of the records are incorrectly flagged as priority (when they should be marked obsolete or dormant) any reporting based upon the results of this data will be skewed. 2. Identify Issues a. What data needs to be fixed? Any anticipated issues with the data should be identified prior to conducting the Audit in order to ensure that the specific use cases are investigated. b. What data needs to be changed or enhanced? A data dictionary should be created or made available to capture any anticipated values that should reside within a given data field. These values will be utilized via a reference lookup to analyze the level of conformity between the actual value and the recorded value in the reference dictionary. Additionally, any missing values should be updated based upon the documented data dictionary value. c. What is a representative set of business rules to demonstrate functionality? Prior to conducting the Audit, a discussion should be held regarding the business rules that should be enforced in the provided data set. The intent is to use the expected business rules as a starting point for validation of the data during the Audit. As new rules are likely to be identified during the Audit, having a starting point ensures that initial results can be quickly disseminated to key stakeholders via an initial data quality iteration that leverages the previously documented business rules. 3. Define Scope a. What can be achieved with which resources in the time available? The scope of the Audit should be defined in order to ensure that a business case can be made for a data quality initiative within weeks, not months. The project should be seen as a pilot in order to validate the anticipated ROI if an Enterprise initiative is pursued. Just as the scope should be well defined, commitments should be agreed upon prior to starting the project that the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

257 of 954

required resources (i.e., data steward, IT representative, business user) will be available as needed during the duration of the project. This will ensure that activities such as the data and business rule review remain on schedule. b. What milestones are critical to other parts of the project? Any relationships between the outcome of the project and other initiatives within the organization should be identified up front. Although the Audit is a pilot project, the data quality results should be reusable on other projects within the organization. If there are specific milestones for the delivery of results, this should be incorporated into the project plan in order to ensure that other projects are not adversely impacted. 4. Highlight Resulting Issues a. Highlight typical issues for the Business, Data Owners, the Governance Team and Senior Management. Upon conclusion of the Audit, the issues uncovered during the project should be summarized and presented to key stakeholders in a workshop setting. During the workshop, the results should be highlighted, along with any anticipated impact to the business if a data quality initiative is not enacted within the organization. b. Test the execution resolution of issues. During the Audit, the resolution of identified issues should occur by leveraging Informatica Data Quality. During the workshop, the means to resolve the issues and the end results should be presented. The types of issues typically resolved include: address validation, ensuring conformity of data through the use of reference dictionaries and the identification and resolution of duplicate data. 5. Build Knowledge a. Gain confidence and knowledge of data quality management strategies, conference room pilots, migrations, etc. To reiterate, the intent of the Audit is to quantify the anticipated ROI within an organization if a data quality strategy is implemented. Additionally, knowledge about the data, the business rules and the potential strategy that can be leveraged throughout the entire organization should be captured.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

258 of 954

b. The rules employed will form the basis of an ongoing DQM Strategy in the target systems. The identified rules should be incorporated into an existing data quality management strategy or utilized as the starting point for a new strategy moving forward. The above steps are intended as a starting point for developing a framework for conducting a Data Quality Audit. From this Audit, the key stakeholders in an organization should have definitive proof as to the extent of the types of data quality issues within their organization and the anticipated ROI that can be achieved through the introduction of data quality throughout the organization.

Last updated: 21-Aug-07 11:48

INFORMATICA CONFIDENTIAL

BEST PRACTICES

259 of 954

Effective Data Matching Techniques Challenge


Identifying and eliminating duplicates is a cornerstone of effective marketing efforts and customer resource management initiatives, and it is an increasingly important driver of cost-efficient compliance with regulatory initiatives such as KYC (Know Your Customer). Once duplicate records are identified, you can remove them from your dataset, and better recognize key relationships among data records (such as customer records from a common household). You can also match records or values against reference data to ensure data accuracy and validity. This Best Practice is targeted toward Informatica Data Quality (IDQ) users familiar with Informatica's matching approach. It has two high-level objectives:
q q

To identify the key performance variables that affect the design and execution of IDQ matching plans. To describe plan design and plan execution actions that will optimize plan performance and results.

To optimize your data matching operations in IDQ, you must be aware of the factors that are discussed below.

Description
All too often, an organization's datasets contain duplicate data in spite of numerous attempts to cleanse the data or prevent duplicates from occurring. In other scenarios, the datasets may lack common keys (such as customer numbers or product ID fields) that, if present, would allow clear joins between the datasets and improve business knowledge. Identifying and eliminating duplicates in datasets can serve several purposes. It enables the creation of a single view of customers; it can help control costs associated with mailing lists by preventing multiple pieces of mail from being sent to the same person or household; and it can assist marketing efforts by identifying households or individuals who are heavy users of a product or service. Data can be enriched by matching across production data and reference data sources. Business intelligence operations can be improved by identifying links between two or more systems to provide a more complete picture of how customers interact with a business. IDQs matching capabilities can help to resolve dataset duplications and deliver business results. However, a users ability to design and execute a matching plan that meets the key requirements of performance and match quality depends on understanding the best-practice approaches described in this document. An integrated approach to data matching involves several steps that prepare the data for matching and improve the overall quality of the matches. The following table outlines the processes in each step.

Step Profiling

Description Typically the first stage of the data quality process, profiling generates a picture of the data and indicates the data elements that can comprise effective group keys. It also highlights the data elements that require standardizing to improve match scores.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

260 of 954

Standardization

Removes noise, excess punctuation, variant spellings, and other extraneous data elements. Standardization reduces the likelihood that match quality will be affected by data elements that are not relevant to match determination. A post-standardization function in which the groups' key fields identified in the profiling stage are used to segment data into logical groups that facilitate matching plan performance. The process whereby the data values in the created groups are compared against one another and record matches are identified according to user-defined criteria. The process whereby duplicate records are cleansed. It identifies the master record in a duplicate cluster and permits the creation of a new dataset or the elimination of subordinate records. Any child data associated with subordinate records is linked to the master record.

Grouping

Matching

Consolidation

The sections below identify the key factors that affect the performance (or speed) of a matching plan and the quality of the matches identified. They also outline the best practices that ensure that each matching plan is implemented with the highest probability of success. (This document does not make any recommendations on profiling, standardization or consolidation strategies. Its focus is grouping and matching.) The following table identifies the key variables that affect matching plan performance and the quality of matches identified.

Factor Group size

Impact Plan performance

Impact summary The number and size of groups have a significant impact on plan execution speed. The proper selection of group keys ensures that the maximum number of possible matches are identified in the plan. Processors, disk performance, and memory require consideration. This is not a high-priority issue. However, it should be considered when designing the plan. The plan designer must weigh file-based versus database matching approaches when considering plan requirements.

Group keys

Quality of matches

Hardware resources

Plan performance

Size of dataset(s)

Plan performance

Informatica Data Quality components

Plan performance

INFORMATICA CONFIDENTIAL

BEST PRACTICES

261 of 954

Time window and frequency of execution

Plan performance

The time taken for a matching plan to complete execution depends on its scale. Timing requirements must be understood up-front. The plan designer must weigh deterministic versus probabilistic approaches.

Match identification

Quality of matches

Group Size
Grouping breaks large datasets down into smaller ones to reduce the number of record-to-record comparisons performed in the plan, which directly impacts the speed of plan execution. When matching on grouped data, a matching plan compares the records within each group with one another. When grouping is implemented properly, plan execution speed is increased significantly, with no meaningful effect on match quality. The most important determinant of plan execution speed is the size of the groups to be processed that is, the number of data records in each group. For example, consider a dataset of 1,000,000 records, for which a grouping strategy generates 10,000 groups. If 9,999 of these groups have an average of 50 records each, the remaining group will contain more than 500,000 records; based on this one large group, the matching plan would require 87 days to complete, processing 1,000,000 comparisons a minute! In comparison, the remaining 9,999 groups could be matched in about 12 minutes if the group sizes were evenly distributed. Group size can also have an impact on the quality of the matches returned in the matching plan. Large groups perform more record comparisons, so more likely matches are potentially identified. The reverse is true for small groups. As groups get smaller, fewer comparisons are possible, and the potential for missing good matches is increased. The goal of grouping is to optimize performance while minimizing the possibility that valid matches will be overlooked because like records are assigned to different groups. Therefore, groups must be defined intelligently through the use of group keys.

Group Keys
Group keys determine which records are assigned to which groups. Group key selection, therefore, has a significant affect on the success of matching operations. Grouping splits data into logical chunks and thereby reduces the total number of comparisons performed by the plan. The selection of group keys, based on key data fields, is critical to ensuring that relevant records are compared against one another. When selecting a group key, two main criteria apply:
q

Candidate group keys should represent a logical separation of the data into distinct units where there is a low probability that matches exist between records in different units. This can be determined by profiling the data and uncovering the structure and quality of the content prior to grouping. Candidate group keys should also have high scores in three keys areas of data quality: completeness, conformity, and accuracy. Problems in these data areas can be improved by standardizing the data prior to grouping.

For example, geography is a logical separation criterion when comparing name and address data. A record for a
INFORMATICA CONFIDENTIAL BEST PRACTICES 262 of 954

person living in Canada is unlikely to match someone living in Ireland. Thus, the country-identifier field can provide a useful group key. However, if you are working with national data (e.g. Swiss data), duplicate data may exist for an individual living in Geneva, who may also be recorded as living in Genf or Geneve. If the group key in this case is based on city name, records for Geneva, Genf, and Geneve will be written to different groups and never compared unless variant city names are standardized.

Size of Dataset
In matching, the size of the dataset typically does not have as significant an impact on plan performance as the definition of the groups within the plan. However, in general terms, the larger the dataset, the more time required to produce a matching plan both in terms of the preparation of the data and the plan execution.

IDQ Components
All IDQ components serve specific purposes, and very little functionality is duplicated across the components. However, there are performance implications for certain component types, combinations of components, and the quantity of components used in the plan. Several tests have been conducted on IDQ (version 2.11) to test source/sink combinations and various operational components. In tests comparing file-based matching against database matching, file-based matching outperformed database matching in UNIX and Windows environments for plans containing up to 100,000 groups. Also, matching plans that wrote output to a CSV Sink outperformed plans with a DB Sink or Match Key Sink. Plans with a Mixed Field Matcher component performed more slowly than plans without a Mixed Field Matcher. Raw performance should not be the only consideration when selecting the components to use in a matching plan. Different components serve different needs and may offer advantages in a given scenario.

Time Window
IDQ can perform millions or billions of comparison operations in a single matching plan. The time available for the completion of a matching plan can have a significant impact on the perception that the plan is running correctly. Knowing the time window for plan completion helps to determine the hardware configuration choices, grouping strategy, and the IDQ components to employ.

Frequency of Execution
The frequency with which plans are executed is linked to the time window available. Matching plans may need to be tuned to fit within the cycle in which they are run. The more frequently a matching plan is run, the more the execution time will have to be considered.

Match Identification
The method used by IDQ to identify good matches has a significant effect on the success of the plan. Two key methods for assessing matches are:
q q

deterministic matching probabilistic matching

Deterministic matching applies a series of checks to determine if a match can be found between two records. IDQs fuzzy matching algorithms can be combined with this method. For example, a deterministic check may first check if
INFORMATICA CONFIDENTIAL BEST PRACTICES 263 of 954

the last name comparison score was greater than 85 percent. If this is true, it next checks the address. If an 80 percent match is found, it then checks the first name. If a 90 percent match is found on the first name, then the entire record is considered successfully matched. The advantages of deterministic matching are: (1) it follows a logical path that can be easily communicated to others, and (2) it is similar to the methods employed when manually checking for matches. The disadvantages to this method are its rigidity and its requirement that each dependency be true. This can result in matches being missed, or can require several different rule checks to cover all likely combinations. Probabilistic matching takes the match scores from fuzzy matching components and assigns weights to them in order to calculate a weighted average that indicates the degree of similarity between two pieces of information. The advantage of probabilistic matching is that it is less rigid than deterministic matching. There are no dependencies on certain data elements matching in order for a full match to be found. Weights assigned to individual components can place emphasis on different fields or areas in a record. However, even if a heavilyweighted score falls below a defined threshold, match scores from less heavily-weighted components may still produce a match. The disadvantages of this method are a higher degree of required tweaking on the users part to get the right balance of weights in order to optimize successful matches. This can be difficult for users to understand and communicate to one another. Also, the cut-off mark for good matches versus bad matches can be difficult to assess. For example, a matching plan with 95 to 100 percent success may have found all good matches, but matching plan success between 90 and 94 percent may map to only 85 percent genuine matches. Matches between 85 and 89 percent may correspond to only 65 percent genuine matches, and so on. The following table illustrates this principle.

Close analysis of the match results is required because of the relationship between match quality and match thresholds scores assigned since there may not be a one-to-one mapping between the plans weighted score and the number of records that can be considered genuine matches.

Best Practice Operations


The following section outlines best practices for matching with IDQ.
INFORMATICA CONFIDENTIAL BEST PRACTICES 264 of 954

Capturing Client Requirements


Capturing client requirements is key to understanding how successful and relevant your matching plans are likely to be. As a best practice, be sure to answer the following questions, as a minimum, before designing and implementing a matching plan:
q q q q q q q

How large is the dataset to be matched? How often will the matching plans be executed? When will the match process need to be completed? Are there any other dependent processes? What are the rules for determining a match? What process is required to sign-off on the quality of match results? What processes exist for merging records?

Test Results
Performance tests demonstrate the following:
q q

IDQ has near-linear scalability in a multi-processor environment. Scalability in standard installations, as achieved in the allocation of matching plans to multiple processors, will eventually level off.

Performance is the key to success in high-volume matching solutions. IDQs architecture supports massive scalability by allowing large jobs to be subdivided and executed across several processors. This scalability greatly enhances IDQs ability to meet the service levels required by users without sacrificing quality or requiring an overly complex solution. If IDQ is integrated with PowerCenter, matching scalability can be achieved using PowerCenter's partitioning capabilities.

Managing Group Sizes


As stated earlier, group sizes have a significant affect on the speed of matching plan execution. Also, the quantity of small groups should be minimized to ensure that the greatest number of comparisons are captured. Keep the following parameters in mind when designing a grouping plan.

Condition Maximum group size

Best practice 5,000 records

Exceptions Large datasets over 2M records with uniform data. Minimize the number of groups containing more than 5,000 records.

Minimum number of singlerecord groups Optimum number of comparisons

1,000 groups per one million record dataset. 500,000,000 comparisons +/- 20 percent per 1 million records

INFORMATICA CONFIDENTIAL

BEST PRACTICES

265 of 954

In cases where the datasets are large, multiple group keys may be required to segment the data to ensure that best practice guidelines are followed. Informatica Corporation can provide sample grouping plans that automate these requirements as far as is practicable.

Group Key Identification


Identifying appropriate group keys is essential to the success of a matching plan. Ideally, any dataset that is about to be matched has been profiled and standardized to identify candidate keys. Group keys act as a first pass or high-level summary of the shape of the dataset(s). Remember that only data records within a given group are compared with one another. Therefore, it is vital to select group keys that have high data quality scores for completeness, conformity, consistency, and accuracy. Group key selection depends on the type of data in the dataset, for example whether it contains name and address data or other data types such as product codes.

Hardware Specifications
Matching is a resource-intensive operation, especially in terms of processor capability. Three key variables determine the effect of hardware on a matching plan: processor speed, disk performance, and memory. The majority of the activity required in matching is tied to the processor. Therefore, the speed of the processor has a significant affect on how fast a matching plan completes. Although the average computational speed for IDQ is one million comparisons per minute, the speed can range from as low as 250,000 comparisons to 6.5 million comparisons per minute, depending on the hardware specification, background processes running, and components used. As a best practice, higher-specification processors (e.g., 1.5 GHz minimum) should be used for high-volume matching plans. Hard disk capacity and available memory can also determine how fast a plan completes. The hard disk reads and writes data required by IDQ sources and sinks. The speed of the disk and the level of defragmentation affect how quickly data can be read from, and written to, the hard disk. Information that cannot be stored in memory during plan execution must be temporarily written to the hard disk. This increases the time required to retrieve information that otherwise could be stored in memory, and also increases the load on the hard disk. A RAID drive may be appropriate for datasets of 3 to 4 million records and a minimum of 512MB of memory should be available. The following table is a rough guide for hardware estimates based on IDQ Runtime on Windows platforms. Specifications for UNIX-based systems vary.

Match volumes < 1,500,000 records 1,500,000 to 3 million records > 3 million records

Suggested hardware specification 1.5 GHz computer, 512MB RAM Multi processor server, 1GB RAM Multi-processor server, 2GB RAM, RAID 5 hard disk

Single Processor vs. Multi-Processor


With IDQ Runtime, it is possible to run multiple processes in parallel. Matching plans, whether they are file-based or database-based, can be split into multiple plans to take advantage of multiple processors on a server. Be aware however, that this requires additional effort to create the groups and consolidate the match output. Also, matching plans split across four processors do not run four times faster than a single-processor matching plan. As a result, multi-processor matching may not significantly improve performance in every case.
INFORMATICA CONFIDENTIAL BEST PRACTICES 266 of 954

Using IDQ with PowerCenter and taking advantage of PowerCenter's partitioning capabilities may also improve throughput. This approach has the advantage that splitting plans into multiple independent plans is not typically required. The following table can help in estimating the execution time between a single and multi-processor match plan.

Plan Type Standardardization/ grouping Matching

Single Processor Depends on operations and size of data set. (Time equals Y) Est 1 million comparisons a minute. (Time equals X)

Multiprocessor Single processor time plus 20 percent. (Time equals Y * 1.20) Time for single processor matching divided by no or processors (NP) multiplied by 25 percent. (Time equals [(X / NP) * 1.25])

For example, if a single processor plan takes one hour to group and standardize the data and eight hours to match, a four-processor match plan should require approximately one hour and 20 minute to group and standardize and two and one half hours to match. The time difference between a single- and multi-processor plan in this case would be more than five hours (i.e., nine hours for the single processor plan versus three hours and 50 minutes for the quad-processor plan).

Deterministic vs. Probabilistic Comparisons


No best-practice research has yet been completed on which type of comparison is most effective at determining a match. Each method has strengths and weaknesses. A 2006 article by Forrester Research stated a preference for deterministic comparisons since they remove the burden of identifying a universal match threshold from the user. Bear in mind that IDQ supports deterministic matching operations only. However, IDQs Weight Based Analyzer component lets plan designers calculate weighted match scores for matched fields.

Database vs. File-Based Matching


File-based matching and database matching perform essentially the same operations. The major differences between the two methods revolve around how data is stored and how the outputs can be manipulated after matching is complete. With regards to selecting one method or the other, there are no best practice recommendations since this is largely defined by requirements. The following table outlines the strengths and weakness of each method:

File-Based Method Ease of implementation Performance Space utilization Operating system restrictions Easy to implement Fastest method Requires more hard-disk space Possible limit to number of groups that can be created

Database Method Requires SQL knowledge Slower than file-based method Lower hard-disk space requirement None

INFORMATICA CONFIDENTIAL

BEST PRACTICES

267 of 954

Ability to control/ manipulate output

Low

High

High-Volume Data Matching Techniques


This section discusses the challenges facing IDQ matching plan designers in opti-mizing their plans for speed of execution and quality of results. It highlights the key factors affecting matching performance and discusses the results of IDQ performance testing in single and multi-processor environments. Checking for duplicate records where no clear connection exists among data elements is a resource-intensive activity. In order to detect matching information, a record must be compared against every other record in a dataset. For a single data source, the quantity of comparisons required to check an entire dataset increases geometrically as the volume of data increases. A similar situation arises when matching between two datasets, where the number of comparisons required is a multiple of the volumes of data in each dataset. When the volume of data increases into the tens of millions, the number of comparisons required to identify matches and consequently, the amount of time required to check for matches reaches impractical levels.

Approaches to High-Volume Matching


Two key factors control the time it takes to match a dataset:
q q

The number of comparisons required to check the data. The number of comparisons that can be performed per minute.

The first factor can be controlled in IDQ through grouping, which involves logically segmenting the dataset into distinct elements, or groups, so that there is a high probability that records within a group are not duplicates of records outside of the group. Grouping data greatly reduces the total number of required comparisons without affecting match accuracy. IDQ affects the number of comparisons per minute in two ways:
q

Its matching components maximize the comparison activities assigned to the com-puter processor. This reduces the amount of disk I/O communication in the system and increases the number of comparisons per minute. Therefore, hard-ware with higher processor speeds has higher match throughputs. IDQ architecture also allows matching tasks to be broken into smaller tasks and shared across multiple processors. The use of multiple processors to handle matching operations greatly enhances IDQ scalability with regard to high-volume matching problems.

The following section outlines how a multi-processor matching solution can be imple-mented and illustrates the results obtained in Informatica Corporation testing.

Multi-Processor Matching: Solution Overview


IDQ does not automatically distribute its load across multiple processors. To scale a matching plan to take advantage of a multi-processor environment, the plan designer must develop multiple plans for execution in parallel. To develop this solution, the plan designer first groups the data to prevent the plan from running low-probability comparisons. Groups are then subdivided into one or more subgroups (the number of subgroups depends on the plan being run and the number of processors in use). Each subgroup is assigned to a discrete matching plan, and
INFORMATICA CONFIDENTIAL BEST PRACTICES 268 of 954

the plans are executed in parallel. The following diagram outlines how multi-processor matching can be implemented in a database model. Source data is first grouped and then subgrouped according to the number of processors available to the job. Each subgroup of data is loaded into a sepa-rate staging area, and the discrete match plans are run in parallel against each table. Results from each plan are consolidated to generate a single match result for the orig-inal source data.

Informatica Corporation Match Plan Tests


Informatica Corporation performed match plan tests on a 2GHz Intel Xeon dual-processor server running Windows 2003 (Server edition). Two gigabytes of RAM were available. The hyper-threading ability of the Xeon processors effectively provided four CPUs on which to run the tests. Several tests were performed using file-based and database-based matching methods and single and multiple processor methods. The tests were performed on one million rows of data. Grouping of the data limited the total number of comparisons to approximately 500,000,000. Test results using file-based and database-based methods showed a near linear scal-ability as the number of available processors increased. As the number of processors increased, so too did the demand on disk I/O resources. As the processor capacity began to scale upward, disk I/O in this configuration eventually limited the benefits of adding additional processor capacity. This is demonstrated in the graph below.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

269 of 954

Execution times for multiple processors were based on the longest execution time of the jobs run in parallel. Therefore, having an even distribution of records across all proc-essors was important to maintaining scalability. When the data was not evenly distributed, some match plans ran longer than others, and the benefits of scaling over multiple processors was not as evident.

Last updated: 26-May-08 17:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

270 of 954

Effective Data Standardizing Techniques Challenge


To enable users to streamline their data cleansing and standardization processes (or plans) with Informatica Data Quality (IDQ). The intent is to shorten development timelines and ensure a consistent and methodological approach to cleansing and standardizing project data.

Description
Data cleansing refers to operations that remove non-relevant information and noise from the content of the data. Examples of cleansing operations include the removal of person names, care of information, excess character spaces, or punctuation from postal address. Data standardization refers to operations related to modifying the appearance of the data, so that it takes on a more uniform structure and to enriching the data by deriving additional details from existing content.

Cleansing and Standardization Operations


Data can be transformed into a standard format appropriate for its business type. This is typically performed on complex data types such as name and address or product data. A data standardization operation typically profiles data by type (e.g., word, number, code) and parses data strings into discrete components. This reveals the content of the elements within the data as well as standardizing the data itself. For best results, the Data Quality Developer should carry out these steps in consultation with a member of the business. Often, this individual is the data steward, the person who best understands the nature of the data within the business scenario.
q

Within IDQ, the Profile Standardizer is a powerful tool for parsing unsorted data into the correct fields. However, when using the Profile Standardizer, be aware that there is a finite number of profiles (500) that can be contained within a cleansing plan. Users can extend the number of profiles by using the first 500 profiles within one component and then feeding the data overflow into a second Profile Standardizer via the Token Parser component.

After the data is parsed and labeled, it should be evident if reference dictionaries will be needed to further standardize the data. It may take several iterations of dictionary construction and review before the data is standardized to an acceptable level. Once acceptable standardization has been achieved, data quality scorecard or dashboard reporting can be introduced. For information on dashboard reporting, see the Report Viewer chapter of the Informatica Data Quality 3.1 User Guide.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

271 of 954

Discovering Business Rules


At this point, the business user may discover and define business rules applicable to the data. These rules should be documented and converted to logic that can be contained within a data quality plan. When building a data quality plan, be sure to group related business rules together in a single rules component whenever possible; otherwise the plan may become very difficult to read. If there are rules that do not lend themselves easily to regular IDQ components (i.e, when standardizing product data information), it may be necessary to perform some custom scripting using IDQs scripting component. This requirement may arise when a string or an element within a string needs to be treated as an array.

Standard and Third-Party Reference Data


Reference data can be a useful tool when standardizing data. Terms with variant formats or spellings can be standardized to a single form. IDQ installs with several reference dictionary files that cover common name and address and business terms. The illustration below shows part of a dictionary of street address suffixes.

Common Issues when Cleansing and Standardizing Data


If the customer has expectations of a bureau-style service, it may be advisable to re-emphasize the score-carding and graded-data approach to cleansing and standardizing. This helps to ensure that the customer develops reasonable expectations of what can be achieved with the data set within an agreed-upon timeframe.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

272 of 954

Standardizing Ambiguous Data


Data values can often appear ambiguous, particularly in name and address data where name, address, and premise values can be interchangeable. For example, Hill, Park, and Church are all common surnames. In some cases, the position of the value is important. ST can be a suffix for street or a prefix for Saint, and sometimes they can both occur in the same string. The address string St Patricks Church, Main St can reasonably be interpreted as Saint Patricks Church, Main Street. In this case, if the delimiter is a space (thus ignoring any commas and periods), the string has five tokens. You may need to write business rules using the IDQ Scripting component, as you are treating the string as an array. St with position 1 within the string would be standardized to meaning_1, whereas St with position 5 would be standardized to meaning_2. Each data value can then be compared to a discrete prefix and suffix dictionary.

Conclusion
Using the data cleansing and standardization techniques described in this Best Practice can help an organization to recognize the value of incorporating IDQ into their development methodology. Because data quality is an iterative process, the business rules initially developed may require ongoing modification, as the results produced by IDQ will be affected by the starting condition of the data and the requirements of the business users. When data arrives in multiple languages, it is worth creating similar IDQ plans for each country and applying the same rules across these plans. The data would typically be staged in a database, and the plans developed using a SQL statement as input, with a where country_code= DE clause, for example. Country dictionaries are identifiable by country code to facilitate such statements. Remember that IDQ installs with a large set of reference dictionaries and additional dictionaries are available from Informatica. IDQ provides several components that focus on verifying and correcting the accuracy of name and postal address data. These components leverage address reference data that originates from national postal carriers such as the United States Postal Service. Such datasets enable IDQ to validate an address to premise level. Please note, the reference datasets are licensed and installed as discrete Informatica products, and thus it is important to discuss their inclusion in the project with the business in advance so as to avoid budget and installation issues. Several types of reference data, with differing levels of address granularity, are available from Informatica. Pricing for the licensing of these components may vary and should be discussed with the Informatica Account Manager.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

273 of 954

Integrating Data Quality Plans with PowerCenter Challenge


This Best Practice outlines the steps to integrate an Informatica Data Quality (IDQ) plan into a PowerCenter mapping. This document assumes that the appropriate setup and configuration of IDQ and PowerCenter have been completed as part of the software installation process and these steps are not included in this document.

Description
Preparing IDQ Plans for PowerCenter Integration
IDQ plans are typically developed and tested by executing from workbench. Plans running locally from workbench can use any of the available IDQ Source and Sink components. This is not true for plans that are integrated into PowerCenter as they can only use Source and Sink components that contain the Enable Real-time processing check box. Specifically those components are CSV Source, CSV Match Source, CSV Sink and CSV Match Sink. In addition, the Real-time Source and Sink can be used; however, they require additional setup as each field name and length must be defined. Database source and sinks are not allowed in PC integration. When IDQ plans are integrated within a PowerCenter mapping, the source and sink need to be enabled by setting the enable real-time processing option on them. Consider the following points when developing a plan for integration in PC.
q

If the IDQ was plan developed using database source and/or sink, you must replace them with CSV Sink/ Source or CSV Match Sink/Source. If the IDQ plan was developed using group sink/source (or dual group sink), you must replace them with either CSV Sink/Source or CSV Match Sink/Source depending on the functionality you are replacing. When replacing group sink you also must add functionality to the PC mapping to replicate the grouping. This is done by placing a join and sort prior to the IDQ plan containing the match. PowerCenter only sees the input and output ports of the IDQ plan from within the PC mapping. This is driven by the input file used for the workbench plan and the fields selected as output in the sink. If you dont see a field after the plan is integrated in PowerCenter, it means the field is not in the input file or not selected as output. PowerCenter integration does not allow input ports to be selected as output if the IDQ transformation is defined as a passive transformation. If the IDQ transformation is configured as active this is not an issue as you must select all fields needed as output from the IDQ transformation within the sink transformation of the IDQ plan. Passive and active IDQ transformations follow the general restrictions and rules for active and passive transformations in PowerCenter. The delimiter of the Source and Sink must be comma for integration IDQ plans. Other fields such as Pipe will cause an error within the PowerCenter Designer. If you encounter this error, go back to workbench, change the delimiter to comma, save the plan and then go back to PowerCenter Designer and perform the import of the plan again. For reusability of IDQ plans, use generic naming conventions for the input and output ports. For example, rather than naming a field Customer address1, customer address2, customer city, name the field address1, address2, city, etc. Thus, if the same standardization and cleansing is needed by multiple sources you can integrate the same IDQ plan, which will reduce development time as well as ongoing maintenance. Use only necessary fields as input to each mapping plan. If you are working with an input file that has 50 fields and you only really need 10 fields for the IDQ plan, create a file that contains only the necessary field names, save it as a comma delimited file and then point to that newly created file from the source of the IDQ plan. This changes the input field reference to only those fields that must be visible in the PowerCenter integration.
BEST PRACTICES 274 of 954

INFORMATICA CONFIDENTIAL

Once the source and sink are converted to real time, you cannot run the plan within workbench, only within the PowerCenter mapping. However, you may change the check box at any time to revert to standalone processing. Be careful not to refresh the IDQ plan in the mapping within PowerCenter while real time is not enabled. If you do so, the PowerCenter mapping will display an error message and will not allow that mapping to be integrated until the Runtime enable is active again.

Integrating IDQ Plans into PowerCenter Mappings


After the IDQ Plans are converted to real time-enabled, they are ready to integrate into a PowerCenter mapping. Integrating into PowerCenter requires proper installation and configuration of the IDQ/PowerCenter integration, including:
q q q q

Making appropriate changes to environment variables (to .profile for UNIX) Installing IDQ on the PowerCenter server Running IDQ Integration and Content install on the server Registering IDQ plug-in via the PowerCenter Admin console Note: The plug-in must be registered in each repository from which an IDQ transformation is to be developed.

q q

Installing IDQ workbench on the workstation Installing IDQ Integration and Content on the workstation using the PowerCenter Designer

When all of the above steps are executed correctly, the IDQ transformation icon, shown below, is visible in the PowerCenter repository.

To integrate an IDQ plan, open the mapping, and click on the IDQ icon. Then click in the mapping workspace to insert the transformation into the mapping. The following dialog box appears:

Select Active or Passive, as appropriate. Typically, an active transformation is necessary only for a matching plan. If selecting Active, IDQ plan input needs to have all input fields passed through, as typical PowerCenter
INFORMATICA CONFIDENTIAL BEST PRACTICES 275 of 954

rules apply to Active and Passive transformation processing. As the following figure illustrates, the IDQ transformation is empty in its initial, un-configured state. Notice all ports are currently blank; they will be populated upon import/integration of the IDQ plan.

Double-click on the title bar for the IDQ transformation to open it for editing.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

276 of 954

Then select the far right tab, Configuration.

When first integrating an IDQ plan, the connection and repository displays are blank. Click the Connect button to establish a connection to the appropriate IDQ repository.

In the Host Name box, specify the name of the computer on which the IDQ repository is installed. This is usually the PowerCenter server. If the default Port Number (3306) was changed during installation, specify the correct value. Next, click Test Connection.
INFORMATICA CONFIDENTIAL BEST PRACTICES 277 of 954

Note: In some cases if the User Name has not been granted privileges on the Host server you will not be allowed to connect. The procedure for granting privileges to the IDQ (MySQL) repository is explained at the end of this document. When the connection is established, click the down arrow to the right of the Plan Name box, and the following dialog is displayed:

Browse to the plan you want to import, then click on the Validate button. If there is an error in the plan, a dialog box appears. For example, if the Source and Sink have not been configured correctly, the following dialog box appears.

If the plan is valid for PowerCenter integration, the following dialog is displayed.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

278 of 954

After a valid plan has been configured, the PowerCenter ports (equivalent to the IDQ Source and Sink fields, are visible and can be connected just as any other PowerCenter transformation.

Refreshing IDQ Plans for PowerCenter Integration


After Data Quality Plans are integrated in PowerCenter, changes made to the IDQ plan in Workbench are not reflected in the PowerCenter mapping until the plan is manually refreshed in the PowerCenter mapping. When you save an IDQ plan, it is saved in the MySQL repository. When you integrate that plan into PowerCenter, a copy of that plan is then integrated in the PowerCenter metadata; the MySQL repository and the PowerCenter repository do not communicate updates automatically. The following paragraphs detail the process for refreshing integrated IDQ plans when necessary to reflect changes made in workbench.
q q q q

Double-click on IDQ transformation in PowerCenter Mapping Select the Configurations tab: Select Refresh. This reads the current version of the plan and refreshes it within PowerCenter. Select apply. If any PowerCenter-specific errors were created when the plan was modified, an error dialog is displayed.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

279 of 954

Update input, output, and pass-through ports as necessary, then save the mapping in PowerCenter, and test the changes.

Saving IDQ Plans to the Appropriate Repository MySQL Permissions


Plans that are to be integrated into PowerCenter mappings must be saved to an IDQ Repository that is visible to the PowerCenter Designer prior to integration. The usual practice is to save the plan to the IDQ repository located on the PowerCenter server.

In order for a Workbench client to save a plan to that repository, the client machine must be granted permissions to the MySQL on the server. If the client machine has not been granted access, the client receives an error message when attempting to access the server repository. The person at your organization who has login rights to the server on which IDQ is installed needs to perform this task for all users who will need to save or retrieve plans from the IDQ Server. This procedure is detailed below.
q q

Identify the IP address for any client machine that needs to be granted access. Login to the server on which the MySQL repository is located and login to MySQL: mysql u root

For a user to connect to IDQ server, save and retrieve plans, enter the following command: grant all privileges on *.* to admin@<idq_client_ip>

For a user to integrate an IDQ plan into PowerCenter, grant the following privilege:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

280 of 954

grant all privileges on *.* to root@<powercenter_client_ip>

Last updated: 20-May-08 23:18

INFORMATICA CONFIDENTIAL

BEST PRACTICES

281 of 954

Managing Internal and External Reference Data Challenge


To provide guidelines for the development and management of the reference data sources that can be used with data quality plans in Informatica Data Quality (IDQ). The goal is to ensure the smooth transition from development to production for reference data files and the plans with which they are associated.

Description
Reference data files can be used by a plan to verify or enhance the accuracy of the data inputs to the plan. A reference data file is a list of verified-correct terms and, where appropriate, acceptable variants on those terms. It may be a list of employees, package measurements, or valid postal addresses any data set that provides an objective reference against which project data sources can be checked or corrected. Reference files are essential to some, but not all data quality processes. Reference data can be internal or external in origin. Internal data is specific to a particular project or client. Such data is typically generated from internal company information. It may be custom-built for the project. External data has been sourced or purchased from outside the organization. External data is used when authoritative, independently-verified data is needed to provide the desired level of data quality to a particular aspect of the source data. Examples include the dictionary files that install with IDQ, postal address data sets that have been verified as current and complete by a national postal carrier, such as United States Postal Service, or company registration and identification information from an industrystandard source such as Dun & Bradstreet. Reference data can be stored in a file format recognizable to Informatica Data Quality or in a format that requires intermediary (third-party) software in order to be read by Informatica applications. Internal data files, as they are often created specifically for data quality projects, are typically saved in the dictionary file format or as delimited text files, which are easily portable into dictionary format. Databases can also be used as a source for internal data. External files are more likely to remain in their original format. For example, external data may be contained in a database or in a library whose files cannot be edited or opened on the desktop to reveal discrete data values.

Working with Internal Data Obtaining Reference Data


Most organizations already possess much information that can be used as reference data for example, employee tax numbers or customer names. These forms of data may or may not be part of the project source data, and they may be stored in different parts of the organization.
INFORMATICA CONFIDENTIAL BEST PRACTICES 282 of 954

The question arises, are internal data sources sufficiently reliable for use as reference? Bear in mind that in some cases the reference data does not need to be 100 percent accurate. It can be good enough to compare project data against reference data and to flag inconsistencies between them, particularly in cases where both sets of data are highly unlikely to share common errors.

Saving the Data in .DIC File Format


IDQ installs with a set of reference dictionaries that have been created to handle many types of business data. These dictionaries are created using a proprietary .DIC file name extension. DIC is abbreviated from dictionary, and dictionary files are essentially comma delimited text files. You can create a new dictionary in three ways:
q

You can save an appropriately formatted delimited file as a .DIC file into the Dictionaries folders of your IDQ (client or server) installation. You can use the Dictionary Manager within Data Quality Workbench. This method allows you to create text and database dictionaries. You can write from plan files directly to a dictionary using the IDQ Report Viewer (see below).

The figure below shows a dictionary file open in IDQ Workbench and its underlying .DIC file open in a text editor. Note that the dictionary file has at least two columns of data. The Label column contains the correct or standardized form of each datum from the dictionarys perspective. The Item columns contain versions of each datum that the dictionary recognizes as identical to or coterminous with the Label entry. Therefore, each datum in the dictionary must have at least two entries in the DIC file (see the text editor illustration below). A dictionary can have multiple Item columns.

To edit a dictionary value, open the DIC file and make your changes. You can make changes either through a text editor or by opening the dictionary in the Dictionary Manager.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

283 of 954

To add a value to a dictionary, open the DIC file in Dictionary Manager, place the cursor in an empty row, and add a Label string and at least one Item string. You can also add values in a text editor by placing the cursor on a new line and typing Label and Item values separated by commas. Once saved, the dictionary is ready for use in IDQ. Note: IDQ users with database expertise can create and specify dictionaries that are linked to database tables, and that thus can be updated dynamically when the underlying data is updated. Database dictionaries are useful when the reference data has been originated for other purposes and is likely to change independently of data quality. By making use of a dynamic connection, data quality plans can always point to the current version of the reference data.

Sharing Reference Data Across the Organization


As you can publish or export plans from a local Data Quality repository to server repositories, so you can copy dictionaries across the network. The File Manager within IDQ Workbench provides an Explorer-like mechanism for moving files to other machines across the network. Bear in mind that Data Quality looks for .DIC files in pre-set locations within the IDQ installation when running a plan. By default, Data Quality relies on dictionaries being located in the following locations:
q q

The Dictionaries folders installed with Workbench and Server. The users file space in the Data Quality service domain.

IDQ does not recognize a dictionary file that is not in such a location, even if you can browse to the file when designing the data quality plan. Thus, any plan that uses a dictionary in a non-standard location will fail. This is most relevant when you publish or export a plan to another machine on the network. You must ensure that copies of any dictionary files used in the local plan are available in a suitable location on the service domain in the user space on the server, or at a location in the servers Dictionaries folders that corresponds to the dictionaries location on Workbench when the plan is copied to the server-side repository. Note: You can change the locations in which IDQ looks for plan dictionaries by editing the config.xml file. However, this is the master configuration file for the product and you should not edit it without consulting Informatica Support. Bear in mind that Data Quality looks only in the locations set in the config.xml file.

Version Controlling Updates and Managing Rollout from Development to Production


Plans can be version-controlled during development in Workbench and when published to a domain repository. You can create and annotate multiple versions of a plan, and review/roll back to earlier versions when necessary. Dictionary files are not version controlled by IDQ, however. You should define a process to log changes and back-up your dictionaries using version control software if possible or a manual method. If modifications are to be made to the versions of dictionary files installed by the software, it is recommended that these modifications be made to a copy of the original file, renamed or relocated as desired. This approach avoids the risk that a subsequent installation might overwrite changes.
INFORMATICA CONFIDENTIAL BEST PRACTICES 284 of 954

Database reference data can also be version controlled, although this presents difficulties if the database is very large in size. Bear in mind that third-party reference data, such as postal address data, should not ordinarily be changed, and so the need for a versioning strategy for these files is debatable.

Working with External Data Formatting Data into Dictionary Format


External data may or may not permit the copying of data into text format for example, external data contained in a database or in library files. Currently, third-party postal address validation data is provided to Informatica users in this manner, and IDQ leverages software from the vendor to read these files. (The third-party software has a very small footprint.) However, some software files can be amenable to data extraction to file.

Obtaining Updates for External Reference Data


External data vendors produce regular data updates, and its vital to refresh your external reference data when updates become available. The key advantage of external data its reliability is lost if you do not apply the latest files from the vendor. If you obtained third-party data through Informatica, you will be kept up to date with the latest data as it becomes available for as long as your data subscription warrants. You can check that you possess the latest versions of third-party data by contacting your Informatica Account Manager.

Managing Reference Updates and Rolling Out Across the Organization


If your organization has a reference data subscription, you will receive either regular data files on compact disc or regular information on how to download data from Informatica or vendor web sites. You must develop a strategy for distributing these updates to all parties who run plans with the external data. This may involve installing the data on machines in a service domain. Bear in mind that postal address data vendors update their offerings every two or three months, and that a significant percentage of postal addresses can change in such time periods. You should plan for the task of obtaining and distributing updates in your organization at frequent intervals. Depending on the number of IDQ installations that must be updated, updating your organization with thirdparty reference data can be a sizable task.

Strategies for Managing Internal and External Reference Data


Experience working with reference data leads to a series of best practice tips for creating and managing reference data files.

Using Workbench to Build Dictionaries


With IDQ Workbench, you can select data fields or columns from a dataset and save them in a dictionarycompatible format.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

285 of 954

Lets say you have designed a data quality plan that identifies invalid or anomalous records in a customer database. Using IDQ, you can create an exception file of these bad records, and subsequently use this file to create a dictionary-compatible file. For example, lets say you have an exception file containing suspect or invalid customer account records. Using a very simple data quality plan, you can quickly parse the account numbers from this file to create a new text file containing the account serial numbers only. This file effectively constitutes the labels column of your dictionary. By opening this file in Microsoft Excel or a comparable program and copying the contents of Column A into Column B, and then saving the spreadsheet as a CSV file, you create a file with Label and Item1 columns. Rename the file with a .DIC suffix and add it to the Dictionaries folder of your IDQ installation: the dictionary is now visible to the IDQ Dictionary Manager. You now have a dictionary file of bad account numbers that you can use in any plans checking the validity of the organization's account records.

Using Report Viewer to Build Dictionaries


The IDQ Report Viewer allows you to create exception files and dictionaries on-the-fly from report data. The figure below illustrates how you can drill-down into report data, right-click on a column, and save the column data as a dictionary file. This file will be populated with Label and Item1 entries corresponding to the column data. In this case, the dictionary created is a list of serial numbers from invalid customer records (specifically, records containing bad zip codes). The plan designer can now create plans to check customer databases against these serial numbers. You can also append data to an existing dictionary file in this manner.

As a general rule, it is a best practice to follow the dictionary organization structure installed by the application, adding to that structure as necessary to accommodate specialized and supplemental dictionaries. Subsequent users are then relieved of the need to examine the config.xml file for possible modifications, thereby lowering the risk of accidental errors during migration. When following the original dictionary organization structure is not practical or contravenes other requirements, take care to document
INFORMATICA CONFIDENTIAL BEST PRACTICES 286 of 954

the customizations. Since external data may be obtained from third parties and may not be in file format, the most efficient way to share its content across the organization is to locate it on the Data Quality Server machine. (Specifically, this is the machine that hosts the Execution Service.)

Moving Dictionary Files After IDQ Plans are Built


This is a similar issue to that of sharing reference data across the organization. If you must move or relocate your reference data files post-plan development, you have three options:
q q

You can reset the location to which IDQ looks by default for dictionary files. You can reconfigure the plan components that employ the dictionaries to point to the new location. Depending on the complexity of the plan concerned, this can be very labor-intensive. If deploying plans in a batch or scheduled task, you can append the new location to the plan execution command. You can do this by appending a parameter file to the plan execution instructions on the command line. The parameter file is an xml file that can contain a simple command to use one file path instead of another.

Last updated: 08-Feb-07 17:09

INFORMATICA CONFIDENTIAL

BEST PRACTICES

287 of 954

Real-Time Matching Using PowerCenter Challenge

This Best Practice describes the rationale for matching in real-time along with the concepts and strategies used in planning for and developing a real-time matching solution. It also provides step-by-step instructions on how to build this process using Informaticas PowerCenter and Data Quality. The cheapest and most effective way to eliminate duplicate records from a system is to prevent them from ever being entered in the first place. Whether the data is coming from a website, an application entry, EDI feeds messages on a queue, changes captured from a database, or other common data feeds, taking these records and matching them against existing master data that already exists allows for only the new, unique records to be added.
q q q q q

Benefits of preventing duplicate records include: Better ability to service customer, with the most accurate and complete information readily available Reduced risk of fraud or over-exposure Trusted information at the source Less effort in BI, data warehouse, and/or migration projects

Description
Performing effective real-time matching involves multiple puzzle pieces. 1. There is a master data set (or possibly multiple master data sets) that contain clean and unique customers, prospects, suppliers, products, and/or many other types of data. 2. To interact with the master data set, there is an incoming transaction; typically thought to be a new item. This transaction can be anything from a new customer signing up on the web to a list of new products; this is anything that is assumed to be new and intended to be added to master. 3. There must be a process to determine if a new item really is new or if it already exists within the master data set. In a perfect world of consistent ids, spellings, and representations of data across all companies and systems, checking for duplicates would simply be some sort of exact lookup into the master to see if the item already exists. Unfortunately, this is not the case and even being creative and using %LIKE% syntax does not provide thorough results. For example, comparing Bob to Robert or GRN to Green requires a more sophisticated approach.

Standardizing Data in Advance of Matching


The first prerequisite for successful matching is to cleanse and standardize the master data set. This process requires well-defined rules for important attributes. Applying these rules to the data should result in complete, consistent, conformant, valid data, which really means trusted data. These rules should also be reusable so they can be used with the incoming transaction data prior to matching. The more compromises made in the quality of master data by failing to cleanse and standardize, the more effort will need to be put into the matching logic, and the less value the organization will derive from it. There will be many more chances of missed matches allowing duplicates to enter the system. Once the master data is cleansed, the next step is to develop criteria for candidate selection. For efficient matching, there is no need to compare records that are so dissimilar that they cannot meet the business rules for matching. On the other hand, the set of candidates must be sufficiently broad to minimize the chance that similar records will not be compared. For example, when matching consumer data on name and address, it may be sensible to limit candidate pull records to those having the same zip code and the same first letter of the last name, because we can reason that if those elements are different between two records, those two records will not match.
INFORMATICA CONFIDENTIAL BEST PRACTICES 288 of 954

There also may be cases where multiple candidate sets are needed. This would be the case if there are multiple sets of match rules that the two records will be compared against. Adding to the previous example, think of matching on name and address for one set of match rules and name and phone for a second. This would require selecting records from the master that have the same phone number and first letter of the last name. Once the candidate selection process is resolved, the matching logic can be developed. This can consist of matching one to many elements of the input record to each candidate pulled from the master. Once the data is compared each pair of records, one input and one candidate, will have a match score or a series of match scores. Scores below a certain threshold can then be discarded and potential matches can be output or displayed. The full real-time match process flow includes: 1. The input record coming into the server 2. The server then standardizes the incoming record and retrieves candidate records from the master data source that could match the incoming record 3. Match pairs are then generated, one for each candidate, consisting of the incoming record and the candidate 4. The match pairs then go through the matching logic resulting in a match score 5. Records with a match score below a given threshold are discarded 6. The returned result set consists of the candidates that are potential matches to the incoming record

Developing an Effective Candidate Selection Strategy


Determining which records from the master should be compared with the incoming record is a critical decision in an effective real-time matching system. For most organizations it is not realistic to match an incoming record to all master records. Consider even a modest customer master data set with one million records; the amount of processing, and thus the wait in real-time would be unacceptable. Candidate selection for real-time matching is synonymous to grouping or blocking for batch matching. The goal of candidate selection is to select only that subset of the records from the master that are definitively related by a field, part of a field, or combination of multiple parts/fields. The selection is done using a candidate key or group key. Ideally this key would be constructed and stored in an indexed field within the master table(s) allowing for the quickest retrieval. There are many instances where multiple keys are used to allow for one key to be missing or different, while another pulls in the record as a candidate. What specific data elements the candidate key should consist of very much depends on the scenario and the match rules. The one common theme with candidate keys is the data elements used should have the highest levels of completeness and validity possible. It is also best to use elements that can be verified as valid, such as a postal code
INFORMATICA CONFIDENTIAL BEST PRACTICES 289 of 954

or a National ID. The table below lists multiple common matching elements and how group keys could be used around the data. The ideal size of the candidate record sets, for sub-second response times, should be under 300 records. For acceptable two to three second response times, candidate record counts should be kept under 5000 records.

Step by Step Development


The following instructions further explain the steps for building a solution to real-time matching using the Informatica suite. They involve the following applications:
q q q q

Informatica PowerCenter 8.5.1 - utilizing Web Services Hub Informatica Data Explorer 5.0 SP4 Informatica Data Quality 8.5 SP1 utilizing North American Country Pack SQL Server 2000
BEST PRACTICES 290 of 954

INFORMATICA CONFIDENTIAL

Scenario:
q

A customer master file is provided with the following structure

q q

In this scenario, we are performing a name and address match Because address is part of the match, we will use the recommended address grouping strategy for our candidate key (see table1) The desire is that different applications from the business will be able to make a web service call to determine if the data entry represents a new customer or an existing customer

Solution: 1. The first step is to analyze the customer master file. Assume that this analysis shows the postcode field is complete for all records and the majority of it is of high accuracy. Assume also that neither the first name or last name field is completely populated; thus the match rules we must account for blank names. 2. The next step is to load the customer master file into the database. Below is a list of tasks that should be implemented in the mapping that loads the customer master data into the database:
q

Standardize and validate the address, outputting the discreet address components such as house number, street name, street type, directional, and suite number. (Pre-built mapplet to do this; country pack) Generate the candidate key field, populate that with the selected strategy (assume it is the first 3 characters of the zip, house number, and the first character of street name), and generate an index on that field. (Expression, output of previous mapplet, hint: substr(in_ZIPCODE, 0, 3)|| in_HOUSE_NUMBER||substr(in_STREET_NAME, 0, 1)) Standardize the phone number. (Pre-built mapplet to do this; country pack) Parse the name field into individual fields. Although the data structure indicates names are already parsed into first, middle, and last, assume there are examples where the names are not properly fielded. Also remember to output a value to handle of nicknames. (Pre-built mapplet to do this; country pack) Once complete, your customer master table should look something like this:

q q

INFORMATICA CONFIDENTIAL

BEST PRACTICES

291 of 954

3. Now that the customer master has been loaded, a Web Service mapping must be created to handle real-time matching. For this project, assume that the incoming record will include a full name field, address, city, state, zip, and a phone number. All fields will be free-form text. Since we are providing the Service, we will be using a Web Service Provider source and target. Follow these steps to build the source and target definitions.
q

Within PowerCenter Designer, go to the source analyzer and select the source menu. From there select Web Service Provider and the Create Web Service Definition.

You will see a screen like the one below where the Service can be named and input and output ports can be created. Since this is a matching scenario, the potential that multiple records will be returned must be taken into account. Select the Multiple Occurring Elements checkbox for the output ports section. Also add a match score output field to return the percentage at which the input record matches the different potential matching records from the master.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

292 of 954

Both the source and target should now be present in the project folder.

4. An IDQ match plan must be build to use within the mapping. In developing a plan for real-time, using a CSV source and CSV sink, both enabled for real-time is the most significant difference from a similar match plan designed for use in IDQ standalone. The source will have the _1 and the _2 fields that a Group Source would supply built into it, e.g. Firstname_1 & Firstname_2. Another difference from batch matching in PowerCenter is that the DQ transformation can be set to passive. The following steps illustrate converting the North America Country Packs Individual Name and Address Match Plan from a plan built for use in a batch mapping to a plan built for use in a real-time mapping.
q

Open the DCM_NorthAmerica project and from within the Match folder make a copy of the Individual Name and Address Match plan. Rename it to RT Individual Name and Address Match. Create a new stub CSV file with only the header row. This will be used to generate a new CSV Source within the plan. This header must use all of the input fields used by the plan before modification. For convenience, a sample stub header is listed below. The header for the stub file will duplicate all of the fields, with one set having a suffix of _1 and the other _2. IN_GROUP_KEY_1,IN_FIRSTNAME_1,IN_FIRSTNAME_ALT_1, IN_MIDNAME_1,IN_LASTNAME_1,IN_POSTNAME_1, IN_HOUSE_NUM_1,IN_STREET_NAME_1,IN_DIRECTIONAL_1, IN_ADDRESS2_1,IN_SUITE_NUM_1,IN_CITY_1,IN_STATE_1,

INFORMATICA CONFIDENTIAL

BEST PRACTICES

293 of 954

IN_POSTAL_CODE_1,IN_GROUP_KEY_2,IN_FIRSTNAME_2, IN_FIRSTNAME_ALT_2,IN_MIDNAME_2,IN_LASTNAME_2, IN_POSTNAME_2,IN_HOUSE_NUM_2,IN_STREET_NAME_2, IN_DIRECTIONAL_2,IN_ADDRESS2_2,IN_CITY_2,IN_STATE_2, IN_POSTAL_CODE_2


q

Now delete the CSV Match Source from the plan and add a new CSV Source, and point it at the new stub file. Because the components were originally mapped to the CSV Match Source and that was deleted, the fields within your plan need to be reselected. As you open the different match components and RBAs, you can see the different instances that need to be reselected as they appear with a red diamond, as seen below.

Also delete the CSV Match Sink and replace it with a CSV Sink. Only the match score field(s) must be selected for output. This plan will be imported into a passive transformation. Consequently, data can be passed around it and does not need to be carried through the transformation. With this implementation you can output multiple match scores so it is possible to see why two records matched or didnt match on a field by field basis. Select the check box for Enable Real-time Processing in both the source and the sink and the plan will be ready to be imported into PowerCenter.

5. The mapping will consist of: a. The source and target previously generated b. An IDQ transformation importing the plan just built c. The same IDQ cleansing and standardization transformations used to load then master data (Refer to step 2 for specifics) d. An Expression transformation to generate the group key and build a single directional field e. A SQL transformation to get the candidate records for the master table f. A Filter transformation to filter those records that match score below a certain threshold g. A Sequence transformation to build a unique key for each matching record returned in the SOAP response
INFORMATICA CONFIDENTIAL BEST PRACTICES 294 of 954

Within PowerCenter Designer, create a new mapping and drag the web service source and target previously created into the mapping. Add the following country pack mapplets to standardize and validate the incoming record from the web service:
r r r

mplt_dq_p_Personal_Name_Standardization_FML mplt_dq_p_USA_Address_Validation mplt_dq_p_USA_Phone_Standardization_Validation

Add an Expression Transformation and build the candidate key from the Address Validation mapplet output fields. Remember to use the same logic as in the mapping that loaded the customer master. Also within the expression, concatenate the pre and post directional field into a single directional field for matching purposes. Add a SQL transformation to the mapping. The SQL transform will present a dialog box with a few questions related to the SQL transformation. For this example select Query mode, MS SQL Server (change as desired), and a Static connection. For details on the other options refer to the PowerCenter help. Connect all necessary fields from the source qualifier, DQ mapplets, and Expression transformation to the SQL transformation. These fields should include:
r r r r

XPK_n4_Envelope (This is the Web Service message key) Parsed name elements Standardized and parsed address elements, which will be used for matching. Standardized phone number

The next step is to build the query from within the SQL transformation to select the candidate records. Make sure that the output fields agree with the query in number, name, and type.

The output of the SQL transform will be the incoming customer record along with the candidate record.
INFORMATICA CONFIDENTIAL BEST PRACTICES 295 of 954

These will be stacked records where the Input/Output fields will represent the input record and the Output only fields will represent the Candidate record. A simple example of this is shown in the table below where a single incoming record will be paired with two candidate records:

Comparing the new record to the candidates is done by embedding the IDQ plan converted in step 4 into the mapping through the use of the Data Quality transformation. When this transformation is created, select passive as the transformation type. The output of the Data Quality transformation will be a match score. This match score will be in a float type format between 0.0 and 1.0. Using a filter transformation, all records that have a match score below a certain threshold will get filtered off. For this scenario, the cut-off will be 80%. (Hint: TO_FLOAT(out_match_score) >= .80) Any record coming out of the filter transformation is a potential match that exceeds the specified threshold, and the record will be included in the response. Each of these records needs a new Unique ID so the Sequence Generator transformation will be used. To complete the mapping, the output of the Filter and Sequence Generator transformations need to be mapped to the target. Make sure to map the input primary key field (XPK_n4_Envelope_output) to the primary key field of the envelope group in the target (XPK_n4_Envelope) and to the foreign key of the response element group in the target (FK_n4_Envelope). Map the output of the Sequence Generator to the primary key field of the response element group. The mapping should look like this:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

296 of 954

6. Before testing the mapping, create a workflow.


q

Using the Workflow Manager, generate a new workflow and session for this mapping using all the defaults. Once created, edit the session task. On the Mapping tab select the SQL transformation and make sure the connection type is relational. Also make sure to select the proper connection. For more advanced tweaking and web service settings see the PowerCenter documentation.

The final step is to expose this workflow as a Web Service. This is done by editing the Workflow. The workflow needs to be Web Services enabled and this is done by selecting the enabled checkbox for Web Services. Once the Web Service is enabled, it should be configured. For all the specific details of this please refer to the PowerCenter documentation, but for the purpose of this scenario: a. Give the service the name you would like to see exposed to the outside world

INFORMATICA CONFIDENTIAL

BEST PRACTICES

297 of 954

b. Set the timeout to 30 seconds c. Allow 2 concurrent runs d. Set the workflow to be visible and runnable

7. The web service is ready for testing.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

298 of 954

Testing Data Quality Plans Challenge


To provide a guide for testing data quality processes or plans created using Informatica Data Quality (IDQ) and to manage some of the unique complexities associated with data quality plans.

Description
Testing data quality plans is an iterative process that occurs as part of the Design Phase of Velocity. Plan testing often precedes the projects main testing activities, as the tested plan outputs will be used as inputs in the Build Phase. It is not necessary to formally test the plans used in the Analyze Phase of Velocity. The development of data quality plans typically follows a prototyping methodology of create, execute, analyze. Testing is performed as part of the third step, in order to determine that the plans are being developed in accordance with design and project requirements. This method of iterative testing helps support rapid identification and resolution of bugs. Bear in mind that data quality plans are designed to analyze and resolve data content issues. These are not typically cut-and-dry problems, but more often represent a continuum of data improvement issues where it is possible that every data instance is unique and there is a target level of data quality rather than a right or wrong answer. Data quality plans tend to resolve problems in terms of percentages and probabilities that a problem is fixed. For example, the project may set a target of 95 percent accuracy in its customer addresses. The level of inaccuracy acceptability is also likely to change over time, based upon the importance of a given data field to the underlying business process. As well, accuracy should continuously improve as the data quality rules are applied and the existing data sets adhere to a higher standard of quality.

Common Questions in Data Quality Plan Testing


q

What dataset will you use to test the plans? While the ideal situation is to use a data set that exactly mimics the project production data, you may not gain access to this data. If you obtain a full cloned set of the project data for testing purposes, bear in mind that some plans (specifically some data

INFORMATICA CONFIDENTIAL

BEST PRACTICES

299 of 954

matching plans) can take several hours to complete. Consider testing data matching plans overnight.
q

Are the plans using reference dictionaries? Reference dictionary management is an important factor since it is possible to make changes to a reference dictionary independently of IDQ and without making any changes to the plan itself. When you pass an IDQ plan as tested, you must ensure that no additional work is carried out on any dictionaries referenced in the plan. Moreover, you must ensure that the dictionary files reside in locations that are valid IDQ. How will the plans be executed? Will they be executed on a remote IDQ Server and/or via a scheduler? In cases like these, its vital to ensure that your plan resources, including source data files and reference data files, are in valid locations for use by the Data Quality engine. For details on the local and remote locations to which IDQ looks for source and reference data files, refer to the Informatica Data Quality 8.5 User Guide. Will the plans be integrated into a PowerCenter transformation? If so, the plans must have real-time enabled data source and sink components.

Strategies for Testing Data Quality Plans


The best practice steps for testing plans can be grouped under two headings.

Testing to Validate Rules


1. Identify a small, representative sample of source data. 2. To determine the results to expect when the plans are run, manually process the data based on the rules for profiling, standardization or matching that the plans will apply. 3. Execute the plans on the test dataset and validate the plan results against the manually-derived results.

Testing to Validate Plan Effectiveness


This process is concerned with establishing that a data enhancement plan has been properly designed; that is, that the plan delivers the required improvements in data quality. This is largely a matter of comparing the business and project requirements for data quality and establishing if the plans are on course to deliver these. If not, the plans may need a thorough redesign or the business and project targets may need to be revised. In either case, discussions should be held with the key business stakeholders to review the results of the IDQ plan and determine the appropriate course of action. In
INFORMATICA CONFIDENTIAL BEST PRACTICES 300 of 954

addition, once the entire data set is processed against the business rules, there may be other data anomalies that were unaccounted for that may require additional modifications to the underlying business rules and IDQ plans.

Last updated: 05-Dec-07 16:02

INFORMATICA CONFIDENTIAL

BEST PRACTICES

301 of 954

Tuning Data Quality Plans Challenge


This document gives an insight into the type of considerations and issues a user needs to be aware of when making changes to data quality processes defined in Informatica Data Quality (IDQ). In IDQ, data quality processes are called plans. The principal focus of this best practice is to know how to tune your plans without adversely affecting the plan logic. This best practice is not intended to replace training materials but serve as a guide for decision making in the areas of adding, removing or changing the operational components that comprise a data quality plan.

Description
You should consider the following questions prior to making changes to a data quality plan:
q

What is the purpose of changing the plan? You should consider changing a plan if you believe the plan is not optimally configured, or the plan is not functioning properly and there is a problem at execution time or the plan is not delivering expected results as per the plan design principles. Are you trained to change the plan? Data quality plans can be complex. You should not alter a plan unless you have been trained or are highly experienced with IDQ methodology. Is the plan properly documented? You should ensure all plan documentation on the data flow and the data components are up-to-date. For guidelines on documenting IDQ plans, see the Sample Deliverable Data Quality Plan Design. Have you backed up the plan before editing? If you are using IDQ in a client-server environment, you can create a baseline version of the plan using IDQ version control functionality. In addition, you should copy the plan to a new project folder (viz., Work_Folder) in the Workbench for changing and testing, and leave the original plan untouched during testing. Is the plan operating directly on production data? This applies especially to standardization plans. When editing a plan, always work on staged data (database or flat-file). You can later migrate the plan to the production environment after complete and thorough testing.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

302 of 954

You should have a clear goal whenever you plan to change an existing plan. An event may prompt the change: for example, input data changing (in format or content), or changes in business rules or business/project targets. You should take into account all current change-management procedures, and the updated plans should be thoroughly tested before production processes are updated. This includes integration and regression testing too. (See also Testing Data Quality Plans.) Bear in mind that at a high level there are two types of data quality plans: data analysis and data enhancement plans.
q

Data analysis plans produce reports on data patterns and data quality across the input data. The key objective in data analysis is to determine the levels of completeness, conformity, and consistency in the dataset. In pursuing these objectives, data analysis plans can also identify cases of missing, inaccurate or noisy data. Data enhancement plans corrects completeness, conformity and consistency problems; they can also identify duplicate data entries and fix accuracy issues through the use of reference data.

Your goal in a data analysis plan is to discover the quality and usability of your data. It is not necessarily your goal to obtain the best scores for your data. Your goal in a data enhancement plan is to resolve the data quality issues discovered in the data analysis.

Adding Components
In general, simply adding a component to a plan is not likely to directly affect results if no further changes are made to the plan. However, once the outputs from the new component are integrated into existing components, the data process flow is changed and the plan must be re-tested and results reviewed in detail before migrating the plan into production. Bear in mind, particularly in data analysis plans, that improved plan statistics do not always mean that the plan is performing better. It is possible to configure a plan that moves beyond the point of truth by focusing on certain data elements and excluding others. When added to existing plans, some components have a larger impact than others. For example, adding a To Upper component to convert text into upper case may not cause the plan results to change meaningfully, although the presentation of the output data will change. However, adding and integrating a Rule Based Analyzer component

INFORMATICA CONFIDENTIAL

BEST PRACTICES

303 of 954

(designed to apply business rules) may cause a severe impact, as the rules are likely to change the plan logic. As well as adding a new component that is, a new icon to the plan, you can add a new instance to an existing component. This can have the same effect as adding and integrating a new component icon. To avoid overloading a plan with too many components, it is a good practice to add multiple instances to a single component, within reason. Good plan design suggests that instances within a single component should be logically similar and work on the selected inputs in similar ways. The overall name for the component should also be changed to reflect the logic of the instances contained in the component. If you add a new instance to a component, and that instance behaves very differently to the other instances in that component for example, if it acts on an unrelated set of outputs or performs an unrelated type of action on the data you should probably add a new component for this instance. This will also help you keep track of your changes onscreen. To avoid making plans over-complicated, it is often a good practice to split tasks into multiple plans where a large amount of data quality measures need to be checked. This makes plans and business rules easier to maintain and provides a good framework for future development. For example, in an environment where a large number of attributes must be evaluated against the six standard data quality criteria (i.e., completeness, conformity, consistency, accuracy, duplication and consolidation) using one plan per data quality criterion may be a good way to move forward. Alternatively, splitting plans up by data entity may be advantageous. Similarly, during standardization, you can create plans for specific function areas (e.g,. address, product, or name) as opposed to adding all standardization tasks to a single large plan. For more information on the six standard data quality criteria, see Data Cleansing

Removing Components
Removing a component from a plan is likely to have a major impact since, in most cases, data flow in the plan will be broken. If you remove an integrated component, configuration changes will be required to all components that use the outputs from the component. The plan cannot run without these configuration changes being completed. The only exceptions to this case are when the output(s) of the removed component are solely used by CSV Sink component or by a frequency component. However, in these cases, you must note that the plan output changes since the column(s) no longer appear in the result set.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

304 of 954

Editing Component Configurations


Changing the configuration of a component can have a comparable impact on the overall plan as adding or removing a component the plans logic changes, and therefore, so do the results that it produces. However, although adding or removing a component may make a plan non-executable, changing the configuration of a component can impact the results in more subtle ways. For example, changing the reference dictionary used by a parsing component does not break a plan, but may have a major impact on the resulting output. Similarly, changing the name of a component instance output does not break a plan. By default, component output names cascade through the other components in the plan, so when you change an output name, all subsequent components automatically update with the new output name. It is not necessary to change the configuration of dependent components.

Last updated: 26-May-08 11:12

INFORMATICA CONFIDENTIAL

BEST PRACTICES

305 of 954

Using Data Explorer for Data Discovery and Analysis Challenge


To understand and make full use of Informatica Data Explorers potential to profile and define mappings for your project data. Data profiling and mapping provide a firm foundation for virtually any project involving data movement, migration, consolidation or integration, from data warehouse/data mart development, ERP migrations, and enterprise application integration to CRM initiatives and B2B integration. These types of projects rely on an accurate understanding of the true structure of the source data in order to correctly transform the data for a given target database design. However, the datas actual form rarely coincides with its documented or supposed form. The key to success for data-related projects is to fully understand the data as it actually is, before attempting to cleanse, transform, integrate, mine, or otherwise operate on it. Informatica Data Explorer is a key tool for this purpose. This Best Practice describes how to use Informatica Data Explorer (IDE) in data profiling and mapping scenarios.

Description
Data profiling and data mapping involve a combination of automated and human analyses to reveal the quality, content and structure of project data sources. Data profiling analyzes several aspects of data structure and content, including characteristics of each column or field, the relationships between fields, and the commonality of data values between fields often an indicator of redundant data.

Data Profiling
Data profiling involves the explicit analysis of source data and the comparison of observed data characteristics against data quality standards. Data quality and integrity issues include invalid values, multiple formats within a field, non-atomic fields (such as long address strings), duplicate entities, cryptic field names, and others. Quality standards may either be the native rules expressed in the source datas metadata, or an external standard (e.g., corporate, industry, or government) to which the source data must be mapped in order to be assessed.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

306 of 954

Data profiling in IDE is based on two main processes:


q q

Inference of characteristics from the data Comparison of those characteristics with specified standards, as an assessment of data quality

Data mapping involves establishing relationships among data elements in various data structures or sources, in terms of how the same information is expressed or stored in different ways in different sources. By performing these processes early in a data project, IT organizations can preempt the code/load/explode syndrome, wherein a project fails at the load stage because the data is not in the anticipated form. Data profiling and mapping are fundamental techniques applicable to virtually any project. The following figure summarizes and abstracts these scenarios into a single depiction of the IDE solution.

The overall process flow for the IDE Solution is as follows:


INFORMATICA CONFIDENTIAL BEST PRACTICES 307 of 954

1. Data and metadata are prepared and imported into IDE.

2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents cleansing and transformation requirements based on the source and normalized schemas. 3. The resultant metadata are exported to and managed in the IDE Repository. 4. In a derived-target scenario, the project team designs the target database by modeling the existing data sources and then modifying the model as required to meet current business and performance requirements. In this scenario, IDE is used to develop the normalized schema into a target database. The normalized and target schemas are then exported to IDEs FTM/XML tool, which documents transformation requirements between fields in the source, normalized, and target schemas. OR 5. In a fixed-target scenario, the design of the target database is a given (i.e., because another organization is responsible for developing it, or because an off-the-shelf package or industry standard is to be used). In this scenario, the schema development process is bypassed. Instead, FTM/XML is used to map the source data fields to the corresponding fields in an externally-specified target schema, and to document transformation requirements between fields in the normalized and target schemas. FTM is used for SQL-based metadata structures, and FTM/XML is used to map SQL and/or XML-based metadata structures. Externally specified targets are typical for ERP package migrations, business-tobusiness integration projects, or situations where a data modeling team is independently designing the target schema. 6. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and loading or formatting specs developed with IDE applications.

IDE's Methods of Data Profiling


IDE employs three methods of data profiling: Column profiling - infers metadata from the data for a column or set of columns. IDE infers both the most likely metadata and alternate metadata which is consistent with the data.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

308 of 954

Table Structural profiling - uses the sample data to infer relationships among the columns in a table. This process can discover primary and foreign keys, functional dependencies, and sub-tables.

Cross-Table profiling - determines the overlap of values across a set of columns, which may come from multiple tables.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

309 of 954

Profiling against external standards requires that the data source be mapped to the standard before being assessed (as shown in the following figure). Note that the mapping is performed by IDEs Fixed Target Mapping tool (FTM). IDE can also be used in the development and application of corporate standards, making them relevant to existing systems as well as to new systems.

Data profiling projects may involve iterative profiling and cleansing as well since data cleansing may improve the quality of the results obtained through dependency and redundancy profiling. Note that Informatica Data Quality should be considered as an alternative tool for data cleansing.

IDE and Fixed-Target Migration


Fixed-target migration projects involve the conversion and migration of data from one or more sources to an externally defined or fixed-target. IDE is used to profile the data and develop a normalized schema representing
INFORMATICA CONFIDENTIAL BEST PRACTICES 310 of 954

the data source(s), while IDEs Fixed Target Mapping tool (FTM) is used to map from the normalized schema to the fixed target. The general sequence of activities for a fixed-target migration project, as shown in the figure below, is as follows: 1. Data is prepared for IDE. Metadata is imported into IDE. 2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents cleansing and transformation requirements based on the source and normalized schemas. The cleansing requirements can be reviewed and modified by the Data Quality team. 3. The resultant metadata are exported to and managed by the IDE Repository. 4. FTM maps the source data fields to the corresponding fields in an externally specified target schema, and documents transformation requirements between fields in the normalized and target schemas. Externallyspecified targets are typical for ERP migrations or projects where a data modeling team is independently designing the target schema. 5. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and loading or formatting specs developed with IDE and FTM. 6. The cleansing, transformation, and formatting specs can be used by the application development or Data Quality team to cleanse the data, implement any required edits and integrity management functions, and develop the transforms or configure an ETL product to perform the data conversion and migration.

The following screen shot shows how IDE can be used to generate a suggested normalized schema, which may discover hidden tables within tables.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

311 of 954

Depending on the staging architecture used, IDE can generate the data definition language (DDL) needed to establish several of the staging databases between the sources and target, as shown below:

Derived-Target Migration
Derived-target migration projects involve the conversion and migration of data from one or more sources to a target database defined by the migration team. IDE is used to profile the data and develop a normalized schema representing the data source(s), and to further develop the normalized schema into a target schema by adding tables and/or fields, eliminating unused tables and/or fields, changing the relational structure, and/or denormalizing the schema to enhance performance. When the target schema is developed from the normalized schema within IDE, the product automatically maintains the mappings from the source to normalized schema, and from the normalized to target schemas. The figure below shows that the general sequence of activities for a derived-target migration project is as follows:
INFORMATICA CONFIDENTIAL BEST PRACTICES 312 of 954

1. Data is prepared for IDE. Metadata is imported into IDE. 2. IDE is used to profile the data, generate accurate metadata (including a normalized schema), and document cleansing and transformation requirements based on the source and normalized schemas. The cleansing requirements can be reviewed and modified by the Data Quality team. 3. IDE is used to modify and develop the normalized schema into a target schema. This generally involves removing obsolete or spurious data elements, incorporating new business requirements and data elements, adapting to corporate data standards, and denormalizing to enhance performance. 4. The resultant metadata are exported to and managed by the IDE Repository. 5. FTM is used to develop and document transformation requirements between the normalized and target schemas. The mappings between the data elements are automatically carried over from the IDE-based schema development process. 6. The IDE Repository is used to export an XSLT document containing the transformation and the formatting specs developed with IDE and FTM/XML. 7. The cleansing, transformation, and formatting specs are used by the application development or Data Quality team to cleanse the data, implement any required edits and integrity management functions, and develop the transforms of configure an ETL product to perform the data conversion and migration.

Last updated: 09-Feb-07 12:55

INFORMATICA CONFIDENTIAL

BEST PRACTICES

313 of 954

Working with Pre-Built Plans in Data Cleanse and Match Challenge


To provide a set of best practices for users of the pre-built data quality processes designed for use with the Informatica Data Cleanse and Match (DC&M) product offering. Informatica Data Cleanse and Match is a cross-application data quality solution that installs two components to the PowerCenter system:
q

Data Cleanse and Match Workbench, the desktop application in which data quality processes - or plans - plans can be designed, tested, and executed. Workbench installs with its own Data Quality repository, where plans are stored until needed. Data Quality Integration, a plug-in component that integrates Informatica Data Quality and PowerCenter. The plug-in adds a transformation to PowerCenter, called the Data Quality Integration transformation; PowerCenter Designer users can connect to the Data Quality repository and read data quality plan information into this transformation.

Informatica Data Cleanse and Match has been developed to work with Content Packs developed by Informatica. This document focuses on the plans that install with the North America Content Pack, which was developed in conjunction with the components of Data Cleanse and Match. The North America Content Pack delivers data parsing, cleansing, standardization, and de-duplication functionality to United States and Canadian name and address data through a series of pre-built data quality plans and address reference data files. This document focuses on the following areas:
q q q

when to use one plan vs. another for data cleansing. what behavior to expected from the plans. how best to manage exception data.

Description
The North America Content Pack installs several plans to the Data Quality Repository:
q q

Plans 01-04 are designed to parse, standardize, and validate United States name and address data. Plans 05-07 are designed to enable single-source matching operations (identifying duplicates within a data set) or dual source matching operations (identifying matching records between two datasets).

The processing logic for data matching is split between PowerCenter and Informatica Data Quality (IDQ) applications.

Plans 01-04: Parsing, Cleansing, and Validation


These plans provide modular solutions for name and address data. The plans can operate on highly unstructured and wellstructured data sources. The level of structure contained in a given data set determines the plan to be used. The following diagram demonstrates how the level of structure in address data maps to the plans required to standardize and validate an address.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

314 of 954

In cases where the address is well structured and specific data elements (i.e., city, state, and zip) are mapped to specific fields, only the address validation plan may be required. Where the city, state, and zip are mapped to address fields, but not specifically labeled as such (e.g., as Address1 through Address5), a combination of the address standardization and validation plans is required. In extreme cases, where the data is not mapped to any address columns, a combination of the general parser, address standardization, and validation plans may be required to obtain meaning from the data. The purpose of making the plans modular is twofold:
q

It is possible to apply these plans on an individual basis to the data. There is no requirement that the plans be run in sequence with each other. For example, the address validation plan (plan 03) can be run successfully to validate input addresses discretely from the other plans. In fact, the Data Quality Developer will not run all seven plans consecutively on the same dataset. Plans 01 and 02 are not designed to operate in sequence, nor are plans 06 and 07. Modular plans facilitate faster performance. Designing a single plan to perform all the processing tasks contained in the seven plans, even if it were desirable from a functional point of view, would result in significant performance degradation and extremely complex plan logic that would be difficult to modify and maintain.

01 General Parser
The General Parser plan was developed to handle highly unstructured data and to parse it into type-specific fields. For example, consider data stored in the following format:

Field1 100 Cardinal Way Redwood City

Field2 Informatica Corp 38725

Field3 CA 94063 100 Cardinal Way

Field4 info@informatica.com CA 94063

Field5 Redwood City info@informatica.com

While it is unusual to see data fragmented and spread across a number of fields in this way, it can and does happen. In cases such as this, data is not stored in any specific fields. Street addresses, email addresses, company names, and dates are scattered throughout the data. Using a combination of dictionaries and pattern recognition, the General Parser plan sorts such data into typespecific fields of address, names, company names, Social Security Numbers, dates, telephone numbers, and email addresses, depending on the profile of the content. As a result, the above data will be parsed into the following format:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

315 of 954

Address1 100 Cardinal Way Redwood City

Address2 CA 94063 100 Cardinal Way

Address3 Redwood City CA 94063

E-mail info@informatica.com info@informatica.com

Date

Company Informatica Corp

08/01/2006

The General Parser does not attempt to apply any structure or meaning to the data. Its purpose is to identify and sort data by information type. As demonstrated with the address fields in the above example, the address fields are labeled as addresses; the contents are not arranged in a standard address format, they are flagged as addresses in the order in which they were processed in the file. The General Parser does not attempt to validate the correctness of a field. For example, the dates are accepted as valid because they have a structure of symbols and numbers that represents a date. A value of 99/99/9999 would also be parsed as a date. The General Parser does not attempt to handle multiple information types in a single field. For example, if a person name and address element are contained in the same field, the General Parser would label the entire field either a name or an address - or leave it unparsed - depending on the elements in the field it can identify first (if any). While the General Parser does not make any assumption about the data prior to parsing, it parses based on the elements of data that it can make sense of first. In cases where no elements of information can be labeled, the field is left in a pipe-delimited form containing unparsed data. The effectiveness of the General Parser to recognize various information types is a function of the dictionaries used to identify that data and the rules used to sort them. Adding or deleting dictionary entries can greatly affect the effectiveness of this plan. Overall, the General Parser is likely only be used in limited cases, where certain types of information may be mixed together, (e.g., telephone and email in the same contact field), or in cases where the data has been badly managed, such as when several files of differing structures have been merged into a single file.

02 Name Standardization
The Name Standardization plan is designed to take in person name or company name information and apply parsing and standardization logic to it. Name Standardization follows two different tracks: one for person names and one for company names. The plan input fields include two inputs for company names. Data that is entered in these fields are assumed to be valid company names, and no additional tests are performed to validate that the data is an existing company name. Any combination of letters, numbers, and symbols can represent a company; therefore, in the absence of an external reference data source, further tests to validate a company name are not likely to yield usable results. Any data entered into the company name fields is subjected to two processes. First, the company name is standardized using the Word Manager component, standardizing any company suffixes included in the field. Second, the standardized company name is matched against the company_names.dic dictionary, which returns the standardized Dun & Bradstreet company name, if found. The second track for name standardization is person names standardization. While this track is dedicated to standardizing person names, it does not necessarily assume that all data entered here is a person name. Person names in North America tend to follow a set structure and typically do not contain company suffixes or digits. Therefore, values entered in this field that contain a company suffix or a company name are taken out of the person name track and moved to the company name track. Additional logic is applied to identify people whose last name is similar (or equal) to a valid company name (for example John Sears); inputs that contain an identified first name and a company name are treated as a person name. If the company name track inputs are already fully populated for the record in question, then any company name detected in a person name column is moved to a field for unparsed company name output. If the name is not recognized as a company name (e. g., by the presence of a company suffix) but contains digits, the data is parsed into the non-name data output field. Any remaining data is accepted as being a valid person name and parsed as such. North American person names are typically entered in one of two different styles: either in a firstname middlename surname format or surname, firstname middlename format. Name parsing algorithms have been built using this assumption. Name parsing occurs in two passes. The first pass applies a series of dictionaries to the name fields, attempting to parse out name
INFORMATICA CONFIDENTIAL BEST PRACTICES 316 of 954

prefixes, name suffixes, firstnames, and any extraneous data (noise) present. Any remaining details are assumed to be middle name or surname details. A rule is applied to the parsed details to check if the name has been parsed correctly. If not, best guess parsing is applied to the field based on the possible assumed formats. When name details have been parsed into first, last, and middle name formats, the first name is used to derive additional details including gender and the name prefix. Finally, using all parsed and derived name elements, salutations are generated. In cases where no clear gender can be generated from the first name, the gender field is typically left blank or indeterminate. The salutation field is generated according to the derived gender information. This can be easily replicated outside the data quality plan if the salutation is not immediately needed as an output from the process (assuming the gender field is an output). Depending on the data entered in the person name fields, certain companies may be treated as person names and parsed according to person name processing rules. Likewise, some person names may be identified as companies and standardized according to company name processing logic. This is typically a result of the dictionary content. If this is a significant problem when working with name data, some adjustments to the dictionaries and the rule logic for the plan may be required. Non-name data encountered in the name standardization plan may be standardized as names depending on the contents of the fields. For example, an address datum such as Corporate Parkway may be standardized as a business name, as Corporate is also a business suffix. Any text data that is entered in a person name field is always treated as a person or company, depending on whether or not the field contains a recognizable company suffix in the text. To ensure that the name standardization plan is delivering adequate results, Informatica strongly recommends pre- and postexecution analysis of the data. Based on the following input:

ROW ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

IN NAME1 Steven King Chris Pope Jr. Shannon C. Prince Dean Jones Mike Judge Thomas Staples Eugene F. Sears Roy Jones Jr. Thomas Smith, Sr Eddie Martin III Martin Luther King, Jr. Staples Corner Sears Chicago Robert Tyre Chris News

The following outputs are produced by the Name Standardization plan:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

317 of 954

The last entry (Chris News) is identified as a company in the current plan configuration such results can be refined by changing the underlying dictionary entries used to identify company and person names.

03 US Canada Standardization
This plan is designed to apply basic standardization processes to city, state/province, and zip/postal code information for United States and Canadian postal address data. The purpose of the plan is to deliver basic standardization to address elements where processing time is critical and one hundred percent validation is not possible due to time constraints. The plan also organizes key search elements into discrete fields, thereby speeding up the validation process. The plan accepts up to six generic address fields and attempts to parse out city, state/province, and zip/postal code information. All remaining information is assumed to be address information and is absorbed into the address line 1-3 fields. Any information that cannot be parsed into the remaining fields is merged into the non-address data field. The plan makes a number of assumptions that may or may not suit your data:
q

When parsing city, state, and zip details, the address standardization dictionaries assume that these data elements are spelled correctly. Variation in town/city names is very limited, and in cases where punctuation differences exist or where town names are commonly misspelled, the standardization plan may not correctly parse the information. Zip codes are all assumed to be five-digit. In some files, zip codes that begin with 0 may lack this first number and so appear as a four-digit codes, and these may be missed during parsing. Adding four-digit zips to the dictionary is not recommended, as these will conflict with the Plus 4 element of a zip code. Zip codes may also be confused with other five-digit numbers in an address line such as street numbers. City names are also commonly found in street names and other address elements. For example, United is part of a country (United States of America) and is also a town name in the U.S. Bear in mind that the dictionary parsing operates from right to left across the data, so that country name and zip code fields are analyzed before city names and street addresses. Therefore, the word United may be parsed and written as the town name for a given address before the actual town name datum is reached. The plan appends a country code to the end of a parsed address if it can identify it as U.S. or Canadian. Therefore, there is no need to include any country code field in the address inputs when configuring the plan.

Most of these issues can be dealt with, if necessary, by minor adjustments to the plan logic or to the dictionaries, or by adding some pre-processing logic to a workflow prior to passing the data into the plan. The plan assumes that all data entered into it are valid address elements. Therefore, once city, state, and zip details have been parsed out, the plan assumes all remaining elements are street address lines and parses them in the order they occurred as address lines 1-3.

04 NA Address Validation
The purposes of the North America Address Validation plan are:
q q

To match input addresses against known valid addresses in an address database, and To parse, standardize, and enrich the input addresses.
BEST PRACTICES 318 of 954

INFORMATICA CONFIDENTIAL

Performing these operations is a resource-intensive process. Using the US Canada Standardization plan before the NA Address Validation plan helps to improve validation plan results in cases where city, state, and zip code information are not already in discrete fields. City, state, and zip are key search criteria for the address validation engine, and they need to be mapped into discrete fields. Not having these fields correctly mapped prior to plan execution leads to poor results and slow execution times. The address validation APIs store specific area information in memory and continue to use that information from one record to the next, when applicable. Therefore, when running validation plans, it is advisable to sort address data by zip/postal code in order to maximize the usage of data in memory. In cases where status codes, error codes, or invalid results are generated as plan outputs, refer to the Informatica Data Quality 3.1 User Guide for information on how to interpret them.

Plans 05-07: Pre-Match Standardization, Grouping, and Matching


These plans take advantage of PowerCenter and IDQ capabilities and are commonly used in pairs. Users will use either plan 05 and 06 or plans 05 and 07. There plans work as follows:
q

05 Match Standardization and Grouping. This plan is used to perform basic standardization and grouping operations on the data prior to matching. 06 Single Source Matching. Single source matching seeks to identify duplicate records within a single data set. 07 Dual Source Matching. Dual source matching seeks to identify duplicate records between two datasets.

q q

Note that the matching plans are designed for use within a PowerCenter mapping and do not deliver optimal results when executed directly from IDQ Workbench. Note also that the Standardization and Matching plans are geared towards North American English data. Although they work with datasets in other languages, the results may be sub-optimal.

Matching Concepts
To ensure the best possible matching results and performance, match plans usually use a pre-processing step to standardize and group the data. The aim for standardization here is different from a classic standardization plan the intent is to ensure that different spellings, abbreviations, etc. are as similar to each other as possible to return better match set. For example, 123 Main Rd. and 123 Main Road will obtain an imperfect match score, although they clearly refer to the same street address. Grouping, in a matching context, means sorting input records based on identical values in one or more user-selected fields. When a matching plan is run on grouped data, serial matching operations are performed on a group-by-group basis, so that data records within a group are matched but records across groups are not. A well-designed grouping plan can dramatically cut plan processing time while minimizing the likelihood of missed matches in the dataset. Grouping performs two functions. It sorts the records in a dataset to increase matching plan performance, and it creates new data columns to provide group key options for the matching plan. (In PowerCenter, the Sorter transformation can organize the data to facilitate matching performance. Therefore, the main function of grouping in a PowerCenter context is to create candidate group keys. In both Data Quality and PowerCenter, grouping operations do not affect the source dataset itself.) Matching on un-grouped data involves a large number of comparisons that realistically will not generate a meaningful quantity of additional matches. For example, when looking for duplicates in a customer list, there is little value in comparing the record for John Smith with the record for Angela Murphy as they are obviously not going to be considered as duplicate entries. The type of grouping used depends on the type of information being matched; in general, productive fields for grouping name and address data are location-based (e.g. city name, zip codes) or person/company based (surname and company name composites). For more information on grouping strategies for best result/performance relationship, see the Best Practice Effective Data Matching Techniques. Plan 05 (Match Standardization and Grouping) performs cleansing and standardization operations on the data before

group keys are generated. It offers a number of grouping options. The plan generates the following group keys:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

319 of 954

q q q q q

OUT_ZIP_GROUP: first 5 digits of ZIP code OUT_ZIP_NAME3_GROUP: first 5 digits of ZIP code and the first 3 characters of the last name OUT_ZIP_NAME5_GROUP: first 5 digits of ZIP code and the first 5 characters of the last name OUT_ZIP_COMPANY3_GROUP: first 5 digits of ZIP code and the first 3 characters of the cleansed company name OUT_ZIP_COMPANY5_GROUP: first 5 digits of ZIP code and the first 5 characters of the cleansed company name

The grouping output used depends on the data contents and data volume.

Plans 06 Single Source Matching and 07 Dual Source Matching


Plans 06 and 07 are set up in similar ways and assume that person name, company name, and address data inputs will be used. However, in PowerCenter, plan 07 requires the additional input of a Source tag, typically generated by an Expression transform upstream in the PowerCenter mapping. A number of matching algorithms are applied to the address and name elements. To ensure the best possible result, a weightbased component and a custom rule are applied to the outputs from the matching components. For further information on IDQ matching components, consult the Informatica Data Quality 3.1 User Guide. By default the plans are configured to write as output all records that match with an 85% percent or higher degree of certainty. The Data Quality Developer can easily adjusted this figure in each plan.

PowerCenter Mappings
When configuring the Data Quality Integration transformation for the matching plan, the Developer must select a valid grouping field.

To ensure best matching results, the PowerCenter mapping that contains plan 05 should include a Sorter transformation that sorts data according to the group key to be used during matching. This transformation should follow standardization and grouping operations. Note that a single mapping can contain multiple Data Quality Integration transformations, so that the Data Quality Developer or Data Integration Developer can add plan 05 to one Integration transformation and plan 06 or 07 to another in the same mapping. The standardization plan requires a passive transformation, whereas the matching plan requires an active
INFORMATICA CONFIDENTIAL BEST PRACTICES 320 of 954

transformation.

The developer can add a Sequencer transformation to the mapping to generate a unique identifier for each input record if these not present in the source data. (Note that a unique identifier is not required for matching processes). When working with the dual source matching plan, additional PowerCenter transformations are required to pre-process the data for the Integration transformation. Expression transformations are used to label each input with a source tag of A and B respectively. The data from the two sources is then joined together using a Union transformation, before being passed to the Integration transformation containing the standardization and grouping plan. From here on, the mapping has the same design as the single source version.

Last updated: 09-Feb-07 13:18

INFORMATICA CONFIDENTIAL

BEST PRACTICES

321 of 954

Designing Data Integration Architectures Challenge


Develop a sound data integration architecture that can serve as a foundation for data integration solutions.

Description
Historically, organizations have approached the development of a "data warehouse" or "data mart" as a departmental effort, without considering an enterprise perspective. The result has been silos of corporate data and analysis, which very often conflict with each other in terms of both detailed data and the business conclusions implied by it. Data integration efforts are often the cornerstone in today's IT initiatives. Taking an enterprise-wide, architect stance in developing data integration solutions provides many advantages, including:
q

A sound architectural foundation ensures the solution can evolve and scale with the business over time. Proper architecture can isolate the application component (business context) of the data integration solution from the technology. Broader data integration efforts will be simplified by using an holisitc enterprise-based approach. Lastly, architectures allow for reuse - reuse of skills, design objects, and knowledge.

As the evolution of data integration solutions (and the corresponding nomenclature) has progressed, the necessity of building these solutions on a solid architectural framework has become more and more clear. To understand why, a brief review of the history of data integration solutions and their predecessors is warranted. As businesses become more global, Service Oriented Architecture (SOA) becomes more of an Information Technology standard. Having a solid architecture is paramount to the success of data Integration efforts.

Historical Perspective
Online Transaction Processing Systems (OLTPs) have always provided a very detailed, transaction-oriented view of an organization's data. While this view was indispensable for the dayto-day operation of a business, its ability to provide a "big picture" view of the operation, critical for management decision-making, was severely limited. Initial attempts to address this problem took several directions: Reporting directly against the production system. This approach minimized the effort associated with developing management reports, but introduced a number of significant issues:
INFORMATICA CONFIDENTIAL BEST PRACTICES 322 of 954

The nature of OLTP data is, by definition, "point-in-time." Thus, reports run at different times of the year, month, or even the day, were inconsistent with each other. Ad hoc queries against the production database introduced uncontrolled performance issues, resulting in slow reporting results and degradation of OLTP system performance. Trending and aggregate analysis was difficult (or impossible) with the detailed data available in the OLTP systems.
q

Mirroring the production system in a reporting database . While this approach alleviated the performance degradation of the OLTP system, it did nothing to address the other issues noted above. Reporting databases . To address the fundamental issues associated with reporting against the OLTP schema, organizations began to move toward dedicated reporting databases. These databases were optimized for the types of queries typically run by analysts, rather than those used by systems supporting data entry clerks or customer service representatives. These databases may or may not have included pre-aggregated data, and took several forms, including traditional RDBMS as well as newer technology Online Analytical Processing (OLAP) solutions.

The initial attempts at reporting solutions were typically point solutions; they were developed internally to provide very targeted data to a particular department within the enterprise. For example, the Marketing department might extract sales and demographic data in order to infer customer purchasing habits. Concurrently, the Sales department was also extracting sales data for the purpose of awarding commissions to the sales force. Over time, these isolated silos of information became irreconcilable, since the extracts and business rules applied to the data during the extract process differed for the different departments The result of this evolution was that the Sales and Marketing departments might report completely different sales figures to executive management, resulting in a lack of confidence in both departments' "data marts." From a technical perspective, the uncoordinated extracts of the same data from the source systems multiple times placed undue strain on system resources. The solution seemed to be the "centralized" or "galactic" data warehouse. This warehouse would be supported by a single set of periodic extracts of all relevant data into the data warehouse (or Operational Data Store), with the data being cleansed and made consistent as part of the extract process. The problem with this solution was its enormous complexity, typically resulting in project failure. The scale of these failures led many organizations to abandon the concept of the enterprise data warehouse in favor of the isolated, "stovepipe" data marts described earlier. While these solutions still had all of the issues discussed previously, they had the clear advantage of providing individual departments with the data they needed without the unmanageability of the enterprise solution. As individual departments pursued their own data and data integration needs, they not only created data stovepipes, they also created technical islands. The approaches to populating the data marts and performing the data integration tasks varied widely, resulting in a single enterprise evaluating,
INFORMATICA CONFIDENTIAL BEST PRACTICES 323 of 954

purchasing, and being trained on multiple tools and adopting multiple methods for performing these tasks. If, at any point, the organization did attempt to undertake an enterprise effort, it was likely to face the daunting challenge of integrating the disparate data as well as the widely varying technologies. To deal with these issues, organizations began developing approaches that considered the enterprise-level requirements of a data integration solution.

Centralized Data Warehouse


The first approach to gain popularity was the centralized data warehouse. Designed to solve the decision support needs for the entire enterprise at one time, with one effort, the data integration process extracts the data directly from the operational systems. It transforms the data according to the business rules and loads it into a single target database serving as the enterprise-wide data warehouse.

Advantages
The centralized model offers a number of benefits to the overall architecture, including:
q

Centralized control . Since a single project drives the entire process, there is centralized control over everything occurring in the data warehouse. This makes it easier to manage a production system while concurrently integrating new components of the warehouse. Consistent metadata . Because the warehouse environment is contained in a single database and the metadata is stored in a single repository, the entire enterprise can be queried whether you are looking at data from Finance, Customers, or Human Resources.
BEST PRACTICES 324 of 954

INFORMATICA CONFIDENTIAL

Enterprise view . Developing the entire project at one time provides a global view of how data from one workgroup coordinates with data from others. Since the warehouse is highly integrated, different workgroups often share common tables such as customer, employee, and item lists. High data integrity . A single, integrated data repository for the entire enterprise would naturally avoid all data integrity issues that result from duplicate copies and versions of the same business data.

Disadvantages
Of course, the centralized data warehouse also involves a number of drawbacks, including:
q

Lengthy implementation cycle. With the complete warehouse environment developed simultaneously, many components of the warehouse become daunting tasks, such as analyzing all of the source systems and developing the target data model. Even minor tasks, such as defining how to measure profit and establishing naming conventions, snowball into major issues. Substantial up-front costs . Many analysts who have studied the costs of this approach agree that this type of effort nearly always runs into the millions. While this level of investment is often justified, the problem lies in the delay between the investment and the delivery of value back to the business. Scope too broad . The centralized data warehouse requires a single database to satisfy the needs of the entire organization. Attempts to develop an enterprise-wide warehouse using this approach have rarely succeeded, since the goal is simply too ambitious. As a result, this wide scope has been a strong contributor to project failure. Impact on the operational systems . Different tables within the warehouse often read data from the same source tables, but manipulate it differently before loading it into the targets. Since the centralized approach extracts data directly from the operational systems, a source table that feeds into three different target tables is queried three times to load the appropriate target tables in the warehouse. When combined with all the other loads for the warehouse, this can create an unacceptable performance hit on the operational systems. Potential integration challenges. A centralized data warehouse has the disadvantage of limited scalability. As businesses change and consolidate, adding new interfaces and/or merging a potentially disparate data source into the centralized data warehouse can be a challenge.

Independent Data Mart


The second warehousing approach is the independent data mart, which gained popularity in 1996 when DBMS magazine ran a cover story featuring this strategy. This architecture is based on the same principles as the centralized approach, but it scales down the scope from solving the warehousing needs of the entire company to the needs of a single department or workgroup. Much like the centralized data warehouse, an independent data mart extracts data directly from the operational sources, manipulates the data according to the business rules, and loads a single
INFORMATICA CONFIDENTIAL BEST PRACTICES 325 of 954

target database serving as the independent data mart. In some cases, the operational data may be staged in an Operational Data Store (ODS) and then moved to the mart.

Advantages
The independent data mart is the logical opposite of the centralized data warehouse. The disadvantages of the centralized approach are the strengths of the independent data mart:
q

Impact on operational databases localized . Because the independent data mart is trying to solve the DSS needs of a single department or workgroup, only the few operational databases containing the information required need to be analyzed. Reduced scope of the data model . The target data modeling effort is vastly reduced since it only needs to serve a single department or workgroup, rather than the entire company. Lower up-front costs . The data mart is serving only a single department or workgroup; thus hardware and software costs are reduced. Fast implementation . The project can be completed in months, not years. The process of defining business terms and naming conventions is simplified since "players from the same team" are working on the project.

Disadvantages

INFORMATICA CONFIDENTIAL

BEST PRACTICES

326 of 954

Of course, independent data marts also have some significant disadvantages:


q

Lack of centralized control . Because several independent data marts are needed to solve the decision support needs of an organization, there is no centralized control. Each data mart or project controls itself, but there is no central control from a single location. Redundant data . After several data marts are in production throughout the organization, all of the problems associated with data redundancy surface, such as inconsistent definitions of the same data object or timing differences that make reconciliation impossible. Metadata integration . Due to their independence, the opportunity to share metadata - for example, the definition and business rules associated with the Invoice data object - is lost. Subsequent projects must repeat the development and deployment of common data objects. Manageability . The independent data marts control their own scheduling routines and therefore store and report their metadata differently, with a negative impact on the manageability of the data warehouse. There is no centralized scheduler to coordinate the individual loads appropriately or metadata browser to maintain the global metadata and share development work among related projects.

Dependent Data Marts (Federated Data Warehouses)


The third warehouse architecture is the dependent data mart approach supported by the hub-andspoke architecture of PowerCenter and PowerExchange. After studying more than one hundred different warehousing projects, Informatica introduced this approach in 1998, leveraging the benefits of the centralized data warehouse and independent data mart. The more general term being adopted to describe this approach is the "federated data warehouse." Industry analysts have recognized that, in many cases, there is no "one size fits all" solution. Although the goal of true enterprise architecture, with conformed dimensions and strict standards, is laudable, it is often impractical, particularly for early efforts. Thus, the concept of the federated data warehouse was born. It allows for the relatively independent development of data marts, but leverages a centralized PowerCenter repository for sharing transformations, source and target objects, business rules, etc. Recent literature describes the federated architecture approach as a way to get closer to the goal of a truly centralized architecture while allowing for the practical realities of most organizations. The centralized warehouse concept is sacrificed in favor of a more pragmatic approach, whereby the organization can develop semi-autonomous data marts, so long as they subscribe to a common view of the business. This common business model is the fundamental, underlying basis of the federated architecture, since it ensures consistent use of business terms and meanings throughout the enterprise. With the exception of the rare case of a truly independent data mart, where no future growth is planned or anticipated, and where no opportunities for integration with other business areas exist, the federated data warehouse architecture provides the best framework for building a data integration solution.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

327 of 954

Informatica's PowerCenter and PowerExchange products provide an essential capability for supporting the federated architecture: the shared Global Repository. When used in conjunction with one or more Local Repositories, the Global Repository serves as a sort of "federal" governing body, providing a common understanding of core business concepts that can be shared across the semi-autonomous data marts. These data marts each have their own Local Repository, which typically include a combination of purely local metadata and shared metadata by way of links to the Global Repository.

This environment allows for relatively independent development of individual data marts, but also supports metadata sharing without obstacles. The common business model and names described above can be captured in metadata terms and stored in the Global Repository. The data marts use the common business model as a basis, but extend the model by developing departmental metadata and storing it locally. A typical characteristic of the federated architecture is the existence of an Operational Data Store (ODS). Although this component is optional, it can be found in many implementations that extract data from multiple source systems and load multiple targets. The ODS was originally designed to extract and hold operational data that would be sent to a centralized data warehouse, working as a time-variant database to support end-user reporting directly from operational systems. A typical ODS had to be organized by data subject area because it did not retain the data model from the operational system. Informatica's approach to the ODS, by contrast, has virtually no change in data model from the operational system, so it need not be organized by subject area. The ODS does not permit direct
INFORMATICA CONFIDENTIAL BEST PRACTICES 328 of 954

end-user reporting, and its refresh policies are more closely aligned with the refresh schedules of the enterprise data marts it may be feeding. It can also perform more sophisticated consolidation functions than a traditional ODS.

Advantages
The Federated architecture brings together the best features of the centralized data warehouse and independent data mart:
q

Room for expansion . While the architecture is designed to quickly deploy the initial data mart, it is also easy to share project deliverables across subsequent data marts by migrating local metadata to the Global Repository. Reuse is built in. Centralized control . A single platform controls the environment from development to test to production. Mechanisms to control and monitor the data movement from operational databases into the data integration environment are applied across the data marts, easing the system management task. Consistent metadata . A Global Repository spans all the data marts, providing a consistent view of metadata. Enterprise view . Viewing all the metadata from a central location also provides an enterprise view, easing the maintenance burden for the warehouse administrators. Business users can also access the entire environment when necessary (assuming that security privileges are granted). High data integrity . Using a set of integrated metadata repositories for the entire enterprise removes data integrity issues that result from duplicate copies of data. Minimized impact on operational systems . Frequently accessed source data, such as customer, product, or invoice records is moved into the decision support environment once, leaving the operational systems unaffected by the number of target data marts.

Disadvantages
Disadvantages of the federated approach include:
q

Data propagation . This approach moves data twice-to the ODS, then into the individual data mart. This requires extra database space to store the staged data as well as extra time to move the data. However, the disadvantage can be mitigated by not saving the data permanently in the ODS. After the warehouse is refreshed, the ODS can be truncated, or a rolling three months of data can be saved. Increased development effort during initial installations . For each table in the target, there needs to be one load developed from the ODS to the target, in addition to all the loads from the source to the targets.

Operational Data Store


Using a staging area or ODS differs from a centralized data warehouse approach since the ODS is not organized by subject area and is not customized for viewing by end users or even for reporting.
INFORMATICA CONFIDENTIAL BEST PRACTICES 329 of 954

The primary focus of the ODS is in providing a clean, consistent set of operational data for creating and refreshing data marts. Separating out this function allows the ODS to provide more reliable and flexible support. Data from the various operational sources is staged for subsequent extraction by target systems in the ODS. In the ODS, data is cleaned and remains normalized, tables from different databases are joined, and a refresh policy is carried out (a change/capture facility may be used to schedule ODS refreshes, for instance). The ODS and the data marts may reside in a single database or be distributed across several physical databases and servers. Characteristics of the Operational Data Store are:
q q q q q

Normalized Detailed (not summarized) Integrated Cleansed Consistent

Within an enterprise data mart, the ODS can consolidate data from disparate systems in a number of ways:
q

Normalizes data where necessary (such as non-relational mainframe data), preparing it for storage in a relational system. Cleans data by enforcing commonalties in dates, names and other data types that appear across multiple systems. Maintains reference data to help standardize other formats; references might range from zip codes and currency conversion rates to product-code-to-product-name translations. The ODS may apply fundamental transformations to some database tables in order to reconcile common definitions, but the ODS is not intended to be a transformation processor for end-user reporting requirements.

Its role is to consolidate detailed data within common formats. This enables users to create wide varieties of data integration reports, with confidence that those reports will be based on the same detailed data, using common definitions and formats. The following table compares the key differences in the three architectures: Architecture Centralized Data Warehouse Independent Data Mart Federated Data Warehouse

INFORMATICA CONFIDENTIAL

BEST PRACTICES

330 of 954

Centralized Control Consistent Metadata Cost effective Enterprise View Fast Implementation High Data Integrity Immediate ROI Repeatable Process

Yes

No

Yes

Yes

No

Yes

No Yes No

Yes No Yes

Yes Yes Yes

Yes

No

Yes

No No

Yes Yes

Yes Yes

The Role of Enterprise Architecture


The federated architecture approach allows for the planning and implementation of an enterprise architecture framework that addresses not only short-term departmental needs, but also the longterm enterprise requirements of the business. This does not mean that the entire architectural investment must be made in advance of any application development. However, it does mean that development is approached within the guidelines of the framework, allowing for future growth without significant technological change. The remainder of this chapter will focus on the process of designing and developing a data integration solution architecture using PowerCenter as the platform.

Fitting Into the Corporate Architecture


Very few organizations have the luxury of creating a "green field" architecture to support their decision support needs. Rather, the architecture must fit within an existing set of corporate guidelines regarding preferred hardware, operating systems, databases, and other software. The Technical Architect, if not already an employee of the organization, should ensure that he/she has a thorough understanding of the existing (and future vision of) technical infrastructure. Doing so will eliminate the possibility of developing an elegant technical solution that will never be implemented because it defies corporate standards.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

331 of 954

Development FAQs Challenge


Using the PowerCenter product suite to effectively develop, name, and document components of the data integration solution. While the most effective use of PowerCenter depends on the specific situation, this Best Practice addresses some questions that are commonly raised by project teams. It provides answers in a number of areas, including Logs, Scheduling, Backup Strategies, Server Administration, Custom Transformations, and Metadata. Refer to the product guides supplied with PowerCenter for additional information.

Description
The following pages summarize some of the questions that typically arise during development and suggest potential resolutions.

Mapping Design
Q: How does source format affect performance? (i.e., is it more efficient to source from a flat file rather than a database?) In general, a flat file that is located on the server machine loads faster than a database located on the server machine. Fixedwidth files are faster than delimited files because delimited files require extra parsing. However, if there is an intent to perform intricate transformations before loading to target, it may be advisable to first load the flat file into a relational database, which allows the PowerCenter mappings to access the data in an optimized fashion by using filters, custom transformations, and custom SQL SELECTs where appropriate. Q: What are some considerations when designing the mapping? (i.e., what is the impact of having multiple targets populated by a single map?) With PowerCenter, it is possible to design a mapping with multiple targets. If each target has a separate source qualifier, you can then load the targets in a specific order using Target Load Ordering. However, the recommendation is to limit the amount of complex logic in a mapping. Not only is it easier to debug a mapping with a limited number of objects, but such mappings can also be run concurrently and make use of more system resources. When using multiple output files (targets), consider writing to multiple disks or file systems simultaneously. This minimizes disk writing contention and applies to a session writing to multiple targets, and to multiple sessions running simultaneously. Q: What are some considerations for determining how many objects and transformations to include in a single mapping? The business requirement is always the first consideration, regardless of the number of objects it takes to fulfill the requirement. Beyond this, consideration should be given to having objects that stage data at certain points to allow both easier debugging and better understandability, as well as to create potential partition points. This should be balanced against the fact that more objects means more overhead for the DTM process. It should also be noted that the most expensive use of the DTM is passing unnecessary data through the mapping. It is best to use filters as early as possible in the mapping to remove rows of data that are not needed. This is the SQL equivalent of the WHERE clause. Using the filter condition in the Source Qualifier to filter out the rows at the database level is a good way to increase the performance of the mapping. If this is not possible, a filter or router transformation can be used instead.

Log File Organization


Q: How does PowerCenter handle logs? The Service Manager provides accumulated log events from each service in the domain and for sessions and workflows. To perform the logging function, the Service Manager runs a Log Manager and a Log Agent. The Log Manager runs on the master gateway node. It collects and processes log events for Service Manager domain operations and application services. The log events contain operational and error messages for a domain. The Service
INFORMATICA CONFIDENTIAL BEST PRACTICES 332 of 954

Manager and the application services send log events to the Log Manager. When the Log Manager receives log events, it generates log event files, which can be viewed in the Administration Console. The Log Agent runs on the nodes to collect and process log events for session and workflows. Log events for workflows include information about tasks performed by the Integration Service, workflow processing, and workflow errors. Log events for sessions include information about the tasks performed by the Integration Service, session errors, and load summary and transformation statistics for the session. You can view log events for the last workflow run with the Log Events window in the Workflow Monitor. Log event files are binary files that the Administration Console Log Viewer uses to display log events. When you view log events in the Administration Console, the Log Manager uses the log event files to display the log events for the domain or application service. For more information, please see Chapter 16: Managing Logs in the Administrator Guide. Q: Where can I view the logs? Logs can be viewed in two locations: the Administration Console or the Workflow Monitor. The Administration Console displays domain-level operational and error messages. The Workflow Monitor displays session and workflow level processing and error messages. Q: Where is the best place to maintain Session Logs? One often-recommended location is a shared directory location that is accessible to the gateway node. If you have more than one gateway node, store the logs on a shared disk. This keeps all the logs in the same directory. The location can be changed in the Administration Console. If you have more than one PowerCenter domain, you must configure a different directory path for each domains Log Manager. Multiple domains can not use the same shared directory path. For more information, please refer to Chapter 16: Managing Logs of the Administrator Guide. Q: What documentation is available for the error codes that appear within the error log files? Log file errors and descriptions appear in Chapter 39: LGS Messages of the PowerCenter Trouble Shooting Guide. Error information also appears in the PowerCenter Help File within the PowerCenter client applications. For other database-specific errors, consult your Database User Guide.

Scheduling Techniques
Q: What are the benefits of using workflows with multiple tasks rather than a workflow with a stand-alone session? Using a workflow to group logical sessions minimizes the number of objects that must be managed to successfully load the warehouse. For example, a hundred individual sessions can be logically grouped into twenty workflows. The Operations group can then work with twenty workflows to load the warehouse, which simplifies the operations tasks associated with loading the targets. Workflows can be created to run tasks sequentially or concurrently, or have tasks in different paths doing either.
q

A sequential workflow runs sessions and tasks one at a time, in a linear sequence. Sequential workflows help ensure that dependencies are met as needed. For example, a sequential workflow ensures that session1 runs before session2 when session2 is dependent on the load of session1, and so on. It's also possible to set up conditions to run the next session only if the previous session was successful, or to stop on errors, etc. A concurrent workflow groups logical sessions and tasks together, like a sequential workflow, but runs all the tasks at one time. This can reduce the load times into the warehouse, taking advantage of hardware platforms' symmetric multiprocessing (SMP) architecture.

Other workflow options, such as nesting worklets within workflows, can further reduce the complexity of loading the warehouse. This capability allows for the creation of very complex and flexible workflow streams without the use of a third-party scheduler.
INFORMATICA CONFIDENTIAL BEST PRACTICES 333 of 954

Q: Assuming a workflow failure, does PowerCenter allow restart from the point of failure? No. When a workflow fails, you can choose to start a workflow from a particular task but not from the point of failure. It is possible, however, to create tasks and flows based on error handling assumptions. If a previously running real-time workflow fails, first recover and then restart that workflow from the Workflow Monitor. Q: How can a failed workflow be recovered if it is not visible from the Workflow Monitor? Start the Workflow Manager and open the corresponding workflow. Find the failed task and right click to "Recover Workflow From Task." Q: What guidelines exist regarding the execution of multiple concurrent sessions / workflows within or across applications? Workflow Execution needs to be planned around two main constraints:
q q

Available system resources Memory and processors

The number of sessions that can run efficiently at one time depends on the number of processors available on the server. The load manager is always running as a process. If bottlenecks with regards to I/O and network are addressed, a session will be compute-bound, meaning its throughput is limited by the availability of CPU cycles. Most sessions are transformation intensive, so the DTM always runs. However, some sessions require more I/O, so they use less processor time. A general rule is that a session needs about 120 percent of a processor for the DTM, reader, and writer in total. For concurrent sessions: One session per processor is about right; you can run more, but that requires a "trial and error" approach to determine what number of sessions starts to affect session performance and possibly adversely affect other executing tasks on the server. If possible, sessions should run at "off-peak" hours to have as many available resources as possible. Even after available processors are determined, it is necessary to look at overall system resource usage. Determining memory usage is more difficult than the processors calculation; it tends to vary according to system load and number of PowerCenter sessions running. The first step is to estimate memory usage, accounting for:
q q q

Operating system kernel and miscellaneous processes Database engine Informatica Load Manager

Next, each session being run needs to be examined with regard to the memory usage, including the DTM buffer size and any cache/memory allocations for transformations such as lookups, aggregators, ranks, sorters and joiners. At this point, you should have a good idea of what memory is utilized during concurrent sessions. It is important to arrange the production run to maximize use of this memory. Remember to account for sessions with large memory requirements; you may be able to run only one large session, or several small sessions concurrently. Load-order dependencies are also an important consideration because they often create additional constraints. For example, load the dimensions first, then facts. Also, some sources may only be available at specific times; some network links may become saturated if overloaded; and some target tables may need to be available to end users earlier than others. Q: Is it possible to perform two "levels" of event notification? At the application level and the PowerCenter Server level to notify the Server Administrator? The application level of event notification can be accomplished through post-session email. Post-session email allows you to
INFORMATICA CONFIDENTIAL BEST PRACTICES 334 of 954

create two different messages; one to be sent upon successful completion of the session, the other to be sent if the session fails. Messages can be a simple notification of session completion or failure, or a more complex notification containing specifics about the session. You can use the following variables in the text of your post-session email:

Email Variable Description %s %l %r %e %t Session name Total records loaded Total records rejected Session status Table details, including read throughput in bytes/second and write throughput in rows/second Session start time Session completion time Session elapsed time (session completion time-session start time) Attaches the session log to the message Name and version of the mapping used in the session Name of the folder containing the session Name of the repository containing the session Attaches the named file. The file must be local to the Informatica Server. The following are valid filenames: %a<c: \data\sales.txt> or %a</users/john/data/sales.txt> On Windows NT, you can attach a file of any type. On UNIX, you can only attach text files. If you attach a nontext file, the send may fail. Note: The filename cannot include the Greater Than character (>) or a line break.

%b %c %i

%g %m %d %n %a<filename>

The PowerCenter Server on UNIX uses rmail to send post-session email. The repository user who starts the PowerCenter server must have the rmail tool installed in the path in order to send email. To verify the rmail tool is accessible: 1. 2. 3. 4. Login to the UNIX system as the PowerCenter user who starts the PowerCenter Server. Type rmail <fully qualified email address> at the prompt and press Enter. Type '.' to indicate the end of the message and press Enter. You should receive a blank email from the PowerCenter user's email account. If not, locate the directory where rmail
BEST PRACTICES 335 of 954

INFORMATICA CONFIDENTIAL

resides and add that directory to the path. 5. When you have verified that rmail is installed correctly, you are ready to send post-session email. The output should look like the following: Session complete. Session name: sInstrTest Total Rows Loaded = 1 Total Rows Rejected = 0 Completed

Rows Loaded Status 1

Rows Rejected

ReadThroughput (bytes/sec)

WriteThroughput Table Name (rows/sec)

30

t_Q3_sales

No errors encountered. Start Time: Tue Sep 14 12:26:31 1999 Completion Time: Tue Sep 14 12:26:41 1999 Elapsed time: 0: 00:10 (h:m:s) This information, or a subset, can also be sent to any text pager that accepts email.

Backup Strategy Recommendation


Q: Can individual objects within a repository be restored from the backup or from a prior version? At the present time, individual objects cannot be restored from a backup using the PowerCenter Repository Manager (i.e., you can only restore the entire repository). But, it is possible to restore the backup repository into a different database and then manually copy the individual objects back into the main repository. It should be noted that PowerCenter does not restore repository backup files created in previous versions of PowerCenter. To correctly restore a repository, the version of PowerCenter used to create the backup file must be used for the restore as well. An option for the backup of individual objects is to export them to XML files. This allows for the granular re-importation of individual objects, mappings, tasks, workflows, etc. Refer to Migration Procedures - PowerCenter for details on promoting new or changed objects between development, test, QA, and production environments.

Server Administration
Q: What built-in functions does PowerCenter provide to notify someone in the event that the server goes down, or some other significant event occurs? The Repository Service can be used to send messages notifying users that the server will be shut down. Additionally, the Repository Service can be used to send notification messages about repository objects that are created, modified, or deleted by another user. Notification messages are received through the PowerCenter Client tools. Q: What system resources should be monitored? What should be considered normal or acceptable server performance levels? The pmprocs utility, which is available for UNIX systems only, shows the currently executing PowerCenter processes.
INFORMATICA CONFIDENTIAL BEST PRACTICES 336 of 954

Pmprocs is a script that combines the ps and ipcs commands. It is available through Informatica Technical Support. The utility provides the following information:
q q q q

CPID - Creator PID (process ID) LPID - Last PID that accessed the resource Semaphores - used to sync the reader and writer 0 or 1 - shows slot in LM shared memory

A variety of UNIX and Windows NT commands and utilities are also available. Consult your UNIX and/or Windows NT documentation. Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an Oracle instance crash? If the UNIX server crashes, you should first check to see if the repository database is able to come back up successfully. If this is the case, then you should try to start the PowerCenter server. Use the pmserver.err log to check if the server has started correctly. You can also use ps -ef | grep pmserver to see if the server process (the Load Manager) is running.

Custom Transformations
Q: What is the relationship between the Java or SQL transformation and the Custom transformation? Many advanced transformations, including Java and SQL, were built using the Custom transformation. Custom transformations operate in conjunction with procedures you create outside of the Designer interface to extend PowerCenter functionality. Other transformations that were built using Custom transformations include HTTP, SQL, Union , XML Parser, XML Generator, and many others. Below is a summary of noticeable differences.

Transformation Custom HTTP Java SQL Union XML Parser XML Generator

# of Input Groups Multiple One One One Multiple One Multiple

# of Output Groups Multiple One One One One Multiple One

Type Active/Passive Passive Active/Passive Active/Passive Active Active Active

For further details, please see the Transformation Guide. Q: What is the main benefit of a Custom transformation over an External Procedure transformation? A Custom transformation allows for the separation of input and output functions, whereas an External Procedure transformation handles both the input and output simultaneously. Additionally, an External Procedure transformations parameters consist of all the ports of the transformation. The ability to separate input and output functions is especially useful for sorting and aggregation, which require all input rows to
INFORMATICA CONFIDENTIAL BEST PRACTICES 337 of 954

be processed before outputting any output rows. Q: How do I change a Custom transformation from Active to Passive, or vice versa? After the creation of the Custom transformation, the transformation type cannot be changed. In order to set the appropriate type, delete and recreate the transformation. Q: What is the difference between active and passive Java transformations? When should one be used over the other? An active Java transformation allows for the generation of more than one output row for each input row. Conversely, a passive Java transformation only allows for the generation of one output row per input row. Use active if you need to generate multiple rows with each input. For example, a Java transformation contains two input ports that represent a start date and an end date. You can generate an output row for each date between the start and end date. Use passive when you need one output row for each input. Q: What are the advantages of a SQL transformation over a Source Qualifier? A SQL transformation allows for the processing of SQL queries in the middle of a mapping. It allows you to insert, delete, update, and retrieve rows from a database. For example, you might need to create database tables before adding new transactions. The SQL transformation allows for the creation of these tables from within the workflow. Q: What is the difference between the SQL transformations Script and Query modes? Script mode allows for the execution of externally located ANSI SQL scripts. Query mode executes a query that you define in a query editor. You can pass strings or parameters to the query to define dynamic queries or change the selection parameters. For more information, please see Chapter 22: SQL Transformation in the Transformation Guide.

Metadata
Q: What recommendations or considerations exist as to naming standards or repository administration for metadata that may be extracted from the PowerCenter repository and used in others? With PowerCenter, you can enter description information for all repository objects, sources, targets, transformations, etc, but the amount of metadata that you enter should be determined by the business requirements. You can also drill down to the column level and give descriptions of the columns in a table if necessary. All information about column size and scale, data types, and primary keys are stored in the repository. The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to enter detailed descriptions of each column, expression, variable, etc, it is also very time consuming to do so. Therefore, this decision should be made on the basis of how much metadata is likely to be required by the systems that use the metadata. There are some time-saving tools that are available to better manage a metadata strategy and content, such as third-party metadata software and, for sources and targets, data modeling tools. Q: What procedures exist for extracting metadata from the repository? Informatica offers an extremely rich suite of metadata-driven tools for data warehousing applications. All of these tools store, retrieve, and manage their metadata in Informatica's PowerCenter repository. The motivation behind the original Metadata Exchange (MX) architecture was to provide an effective and easy-to-use interface to the repository. Today, Informatica and several key Business Intelligence (BI) vendors, including Brio, Business Objects, Cognos, and MicroStrategy, are effectively using the MX views to report and query the Informatica metadata. Informatica strongly discourages accessing the repository directly, even for SELECT access because some releases of PowerCenter change the look and feel of the repository tables, resulting in a maintenance task for you. Rather, views have
INFORMATICA CONFIDENTIAL BEST PRACTICES 338 of 954

been created to provide access to the metadata stored in the repository. Additionally, Informatica's Metadata Manager and Data Analyzer, allow for more robust reporting against the repository database and are able to present reports to the end-user and/or management.

Versioning
Q: How can I keep multiple copies of the same object within PowerCenter? A: With PowerCenter, you can use version control to maintain previous copies of every changed object. You can enable version control after you create a repository. Version control allows you to maintain multiple versions of an object, control development of the object, and track changes. You can configure a repository for versioning when you create it, or you can upgrade an existing repository to support versioned objects. When you enable version control for a repository, the repository assigns all versioned objects version number 1 and each object has an active status. You can perform the following tasks when you work with a versioned object:
q

View object version properties. Each versioned object has a set of version properties and a status. You can also configure the status of a folder to freeze all objects it contains or make them active for editing. Track changes to an object. You can view a history that includes all versions of a given object, and compare any version of the object in the history to any other version. This allows you to determine changes made to an object over time. Check the object version in and out. You can check out an object to reserve it while you edit the object. When you check in an object, the repository saves a new version of the object and allows you to add comments to the version. You can also find objects checked out by yourself and other users. Delete or purge the object version. You can delete an object from view and continue to store it in the repository. You can recover, or undelete, deleted objects. If you want to permanently remove an object version, you can purge it from the repository.

Q: Is there a way to migrate only the changed objects from Development to Production without having to spend too much time on making a list of all changed/affected objects? A: Yes there is. You can create Deployment Groups that allow you to group versioned objects for migration to a different repository. You can create the following types of deployment groups:
q q

Static. You populate the deployment group by manually selecting objects. Dynamic. You use the result set from an object query to populate the deployment group.

To make a smooth transition/migration to Production, you need to have a query associated with your Dynamic deployment group. When you associate an object query with the deployment group, the Repository Agent runs the query at the time of deployment. You can associate an object query with a deployment group when you edit or create a deployment group. If the repository is enabled for versioning, you may also copy the objects in a deployment group from one repository to another. Copying a deployment group allows you to copy objects in a single copy operation from across multiple folders in the source repository into multiple folders in the target repository. Copying a deployment group also allows you to specify individual objects to copy, rather than the entire contents of a folder.

Performance
Q: Can PowerCenter sessions be load balanced?
INFORMATICA CONFIDENTIAL BEST PRACTICES 339 of 954

A: Yes, if the PowerCenter Enterprise Grid Option option is available. The Load Balancer is a component of the Integration Service that dispatches tasks to Integration Service processes running on nodes in a grid. It matches task requirements with resource availability to identify the best Integration Service process to run a task. It can dispatch tasks on a single node or across nodes. Tasks can be dispatched in three ways: Round-robin, Metric-based, and Adaptive. Additionally, you can set the Service Levels to change the priority of each task waiting to be dispatched. This can be changed in the Administration Consoles domain properties. For more information, please refer to Chapter 11: Configuring the Load Balancer in the Administrator Guide.

Web Services
Q: How does Web Services Hub work in PowerCenter? A: The Web Services Hub is a web service gateway for external clients. It processes SOAP requests from web service clients that want to access PowerCenter functionality through web services. Web service clients access the Integration Service and Repository Service through the Web Services Hub. The Web Services Hub hosts Batch and Real-time Web Services. When you install PowerCenter Services, the PowerCenter installer installs the Web Services Hub. Use the Administration Console to configure and manage the Web Services Hub. For more information, please refer to Creating and Configuring the Web Services Hub in the Administrator Guide. The Web Services Hub connects to the Repository Server and the PowerCenter Server through TCP/IP. Web service clients log in to the Web Services Hub through HTTP(s). The Web Services Hub authenticates the client based on repository user name and password. You can use the Web Services Hub console to view service information and download Web Services Description Language (WSDL) files necessary for running services and workflows.

Last updated: 06-Dec-07 15:00

INFORMATICA CONFIDENTIAL

BEST PRACTICES

340 of 954

Event Based Scheduling Challenge


In an operational environment, the beginning of a task often needs to be triggered by some event, either internal or external, to the Informatica environment. In versions of PowerCenter prior to version 6.0, this was achieved through the use of indicator files. In PowerCenter 6.0 and forward, it is achieved through use of the EventRaise and EventWait Workflow and Worklet tasks, as well as indicator files.

Description
Event-based scheduling with versions of PowerCenter prior to 6.0 was achieved through the use indicator files. Users specified the indicator file configuration in the session configuration under advanced options. When the session started, the PowerCenter Server looked for the specified file name; if it wasnt there, it waited until it appeared, then deleted it, and triggered the session. In PowerCenter 6.0 and above, event-based scheduling is triggered by Event-Wait and Event-Raise tasks. These tasks can be used to define task execution order within a workflow or worklet. They can even be used to control sessions across workflows.
q q

An Event-Raise task represents a user-defined event (i.e., an indicator file). An Event-Wait task waits for an event to occurwithin a workflow. After the event triggers, the PowerCenter Server continues executing the workflow from the Event-Wait task forward.

The following paragraphs describe events that can be triggered by an Event-Wait task.

Waiting for Pre-Defined Events


To use a pre-defined event, you need a session, shell command, script, or batch file to create an indicator file. You must create the file locally or send it to a directory local to the PowerCenter Server. The file can be any format recognized by the PowerCenter Server operating system. You can choose to have the PowerCenter Server delete the indicator file after it detects the file, or you can manually delete the indicator file. The PowerCenter Server marks the status of the Event-Wait task as "failed" if it cannot delete the indicator file.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

341 of 954

When you specify the indicator file in the Event-Wait task, specify the directory in which the file will appear and the name of the indicator file. Do not use either a source or target file name as the indicator file name. You must also provide the absolute path for the file and the directory must be local to the PowerCenter Server. If you only specify the file name, and not the directory, Workflow Manager looks for the indicator file in the system directory. For example, on Windows NT, the system directory is C:/winnt/ system32. You can enter the actual name of the file or use server variables to specify the location of the files. The PowerCenter Server writes the time the file appears in the workflow log. Follow these steps to set up a pre-defined event in the workflow: 1. Create an Event-Wait task and double-click the Event-Wait task to open the Edit Tasks dialog box. 2. In the Events tab of the Edit Task dialog box, select Pre-defined. 3. Enter the path of the indicator file. 4. If you want the PowerCenter Server to delete the indicator file after it detects the file, select the Delete Indicator File option in the Properties tab. 5. Click OK.

Pre-defined Event
A pre-defined event is a file-watch event. For pre-defined events, use an Event-Wait task to instruct the PowerCenter Server to wait for the specified indicator file to appear before continuing with the rest of the workflow. When the PowerCenter Server locates the indicator file, it starts the task downstream of the Event-Wait.

User-defined Event
A user-defined event is defined at the workflow or worklet level and the Event-Raise task triggers the event at one point of the workflow/worklet. If an Event-Wait task is configured in the same workflow/worklet to listen for that event, then execution will continue from the Event-Wait task forward. The following is an example of using user-defined events: Assume that you have four sessions that you want to execute in a workflow. You want P1_session and P2_session to execute concurrently to save time. You also want to execute Q3_session after P1_session completes. You want to execute Q4_session only when P1_session, P2_session, and Q3_session complete. Follow these steps:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

342 of 954

1. Link P1_session and P2_session concurrently. 2. Add Q3_session after P1_session 3. Declare an event called P1Q3_Complete in the Events tab of the workflow properties 4. In the workspace, add an Event-Raise task after Q3_session. 5. Specify the P1Q3_Complete event in the Event-Raise task properties. This allows the Event-Raise task to trigger the event when P1_session and Q3_session complete. 6. Add an Event-Wait task after P2_session. 7. Specify the Q1 Q3_Complete event for the Event-Wait task. 8. Add Q4_session after the Event-Wait task. When the PowerCenter Server processes the Event-Wait task, it waits until the Event-Raise task triggers Q1Q3_Complete before it executes Q4_session. The PowerCenter Server executes the workflow in the following order: 1. 2. 3. 4. 5. 6. 7. The PowerCenter Server executes P1_session and P2_session concurrently. When P1_session completes, the PowerCenter Server executes Q3_session. The PowerCenter Server finishes executing P2_session. The Event-Wait task waits for the Event-Raise task to trigger the event. The PowerCenter Server completes Q3_session. The Event-Raise task triggers the event, Q1Q3_complete. The Informatica Server executes Q4_session because the event, Q1Q3_Complete, has been triggered.

Be sure to take carein setting the links though. If they are left as the default and if Q3 fails, the Event-Raise will never happen. Then the Event-Wait will wait forever and the workflow will run until it is stopped. To avoid this, check the workflow option suspend on error. With this option, if a session fails, the whole workflow goes into suspended mode and can send an email to notify developers.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

343 of 954

Key Management in Data Warehousing Solutions Challenge


Key management refers to the technique that manages key allocation in a decision support RDBMS to create a single view of reference data from multiple sources. Informatica recommends a concept of key management that ensures loading everything extracted from a source system into the data warehouse. This Best Practice provides some tips for employing the Informatica-recommended approach of key management, an approach that deviates from many traditional data warehouse solutions that apply logical and data warehouse (surrogate) key strategies where errors are loaded and transactions rejected from referential integrity issues.

Description
Key management in a decision support RDBMS comprises three techniques for handling the following common situations:
q q q

Key merging/matching Missing keys Unknown keys

All three methods are applicable to a Reference Data Store, whereas only the missing and unknown keys are relevant for an Operational Data Store (ODS). Key management should be handled at the data integration level, thereby making it transparent to the Business Intelligence layer.

Key Merging/Matching
When companies source data from more than one transaction system of a similar type, the same object may have different, non-unique legacy keys. Additionally, a single key may have several descriptions or attributes in each of the source systems. The independence of these systems can result in incongruent coding, which poses a greater problem than records being sourced from multiple systems.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

344 of 954

A business can resolve this inconsistency by undertaking a complete code standardization initiative (often as part of a larger metadata management effort) or applying a Universal Reference Data Store (URDS). Standardizing code requires an object to be uniquely represented in the new system. Alternatively, URDS contains universal codes for common reference values. Most companies adopt this pragmatic approach, while embarking on the longer term solution of code standardization. The bottom line is that nearly every data warehouse project encounters this issue and needs to find a solution in the short term.

Missing Keys
A problem arises when a transaction is sent through without a value in a column where a foreign key should exist (i.e., a reference to a key in a reference table). This normally occurs during the loading of transactional data, although it can also occur when loading reference data into hierarchy structures. In many older data warehouse solutions, this condition would be identified as an error and the transaction row would be rejected. The row would have to be processed through some other mechanism to find the correct code and loaded at a later date. This is often a slow and cumbersome process that leaves the data warehouse incomplete until the issue is resolved. The more practical way to resolve this situation is to allocate a special key in place of the missing key, which links it with a dummy 'missing key' row in the related table. This enables the transaction to continue through the loading process and end up in the warehouse without further processing. Furthermore, the row ID of the bad transaction can be recorded in an error log, allowing the addition of the correct key value at a later time. The major advantage of this approach is that any aggregate values derived from the transaction table will be correct because the transaction exists in the data warehouse rather than being in some external error processing file waiting to be fixed. Simple Example: PRODUCT Audi TT18 CUSTOMER Doe10224 SALES REP QUANTITY 1 UNIT PRICE 35,000

In the transaction above, there is no code in the SALES REP column. As this row is processed, a dummy sales rep key (UNKNOWN) is added to the record to link to a record in the SALES REP table. A data warehouse key (8888888) is also added to the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

345 of 954

transaction. PRODUCT Audi TT18 CUSTOMER SALES REP Doe10224 9999999 QUANTITY 1 UNIT PRICE 35,000 DWKEY 8888888

The related sales rep record may look like this: REP CODE 1234567 7654321 9999999 REP NAME David Jones Mark Smith Missing Rep REP MANAGER Mark Smith

An error log entry to identify the missing key on this transaction may look like: ERROR CODE MSGKEY TABLE NAME ORDERS KEY NAME SALES REP KEY 8888888

This type of error reporting is not usually necessary because the transactions with missing keys can be identified using standard end-user reporting tools against the data warehouse.

Unknown Keys
Unknown keys need to be treated much like missing keys except that the load process has to add the unknown key value to the referenced table to maintain integrity rather than explicitly allocating a dummy key to the transaction. The process also needs to make two error log entries. The first, to log the fact that a new and unknown key has been added to the reference table and a second to record the transaction in which the unknown key was found. Simple example: The sales rep reference data record might look like the following:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

346 of 954

DWKEY 1234567 7654321 9999999

REP NAME David Jones Mark Smith Missing Rep

REP MANAGER Mark Smith

A transaction comes into ODS with the record below: PRODUCT Audi TT18 CUSTOMER SALES REP Doe10224 2424242 QUANTITY 1 UNIT PRICE 35,000

In the transaction above, the code 2424242 appears in the SALES REP column. As this row is processed, a new row has to be added to the Sales Rep reference table. This allows the transaction to be loaded successfully. DWKEY 2424242 REP NAME Unknown REP MANAGER

A data warehouse key (8888889) is also added to the transaction. PRODUCT Audi TT18 CUSTOMER SALES REP Doe10224 2424242 QUANTITY 1 UNIT PRICE 35,000 DWKEY 8888889

Some warehouse administrators like to have an error log entry generated to identify the addition of a new reference table entry. This can be achieved simply by adding the following entries to an error log. ERROR CODE NEWROW TABLE NAME KEY NAME SALES REP SALES REP KEY 2424242

A second log entry can be added with the data warehouse key of the transaction in which the unknown key was found.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

347 of 954

ERROR CODE UNKNKEY

TABLE NAME KEY NAME ORDERS SALES REP

KEY 8888889

As with missing keys, error reporting is not essential because the unknown status is clearly visible through the standard end-user reporting. Moreover, regardless of the error logging, the system is self-healing because the newly added reference data entry will be updated with full details as soon as these changes appear in a reference data feed. This would result in the reference data entry looking complete. DWKEY 2424242 REP NAME David Digby REP MANAGER Mark Smith

Employing the Informatica recommended key management strategy produces the following benefits:
q q q q

All rows can be loaded into the data warehouse All objects are allocated a unique key Referential integrity is maintained Load dependencies are removed

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

348 of 954

Mapping Auto-Generation Challenge


In the course of developing mappings for PowerCenter, situations can arise where a set of similar functions/procedures must be executed for each mapping. The first reaction to this issue is generally to employ a mapplet. These objects are suited to situations where all of the individual fields/data are the same across uses of the mapplet. However, in cases where the fields are different but the process is the same a requirement emerges to generate multiple mappings using a standard template of actions and procedures. The potential benefits of Autogeneration are focused on a reduction in the Total Cost of Ownership (TCO) of the integration application and include:
q q q q q q

Reduced build time Reduced requirement for skilled developer resources Promotion of pattern-based design Built in quality and consistency Reduced defect rate through elimination of manual errors Reduced support overhead

Description
From the outset, it should be emphasized that auto-generation should be integrated into the overall development strategy. It is probable that some components will still need to be manually developed and many of the disciplines and best practices that are documented elsewhere in Velocity still apply. It is best to regard autogeneration as a productivity aid in specific situations and not as a technique that works in all situations. Currently, the autogeneration of 100% of the components required is not a realistic objective. All of the techniques discussed here revolve around the generation of an XML file which shares the standard format of exported PowerCenter components as defined in the powrmart.dtd schema definition. After being generated, the resulting XML document is imported into PowerCenter using standard facilities available through the user interface or via command line. With Informatica technology, there are a number of options for XML targeting which can be leveraged to implement autogeneration. Thus you can exploit these features to make the technology self-generating. The stages in implementing an autogeneration strategy are: 1. 2. 3. 4. Establish the Scope for Autogeneration Design the Assembly Line(s) Build the Assembly Line Implement the QA and Testing Strategies

These stages are discussed in more detail in the following sections.

1. Establish the Scope for Autogeneration


There are three types of opportunities for manufacturing components:
q q q

Pattern-Driven Rules-Driven Metadata-Driven

A Pattern-Driven build is appropriate when a single pattern of transformation is to be replicated for multiple source-target
INFORMATICA CONFIDENTIAL BEST PRACTICES 349 of 954

combinations. For example, the initial extract in a standard data warehouse load typically extracts some source data with standardized filters, and then adds some load metadata before populating a staging table which essentially replicates the source structure. The potential for Rules-Driven build typically arises when non-technical users are empowered to articulate transformation requirements in a format which is the source for a process generating components. Usually, this is accomplished via a spreadsheet which defines the source-to-target mapping and uses a standardized syntax to define the transformation rules. To implement this type of autogeneration, it is necessary to build an application (typically based on a PowerCenter mapping) which reads the spreadsheet, matches the sources and targets against the metadata in the repository and produces the XML output. Finally, the potential for Metadata-Driven build arises when the import of source and target metadata enables transformation requirements to be inferred which also requires a mechanism for mapping sources to target. For example, when a text source column is mapped to a numeric target column the inferred rule is to test for data type compatibility. The first stage in the implementation of an autogeneration strategy is to decide which of these autogeneration types is applicable and to ensure that the appropriate technology is available. In most case, it is the Pattern-Driven build which is the main area of interest; this is precisely the requirement which the mapping generation license option within PowerCenter is designed to address. This option uses the freely distributed Informatica Data Stencil design tool for Microsoft Visio and freely distributed Informatica Velocity-based mapping templates to accelerate and automate mapping design. Generally speaking, applications which involve a small number of highly-complex flows of data tailored to very specific source/ target attributes are not good candidates for pattern-driven autogeneration. Currently, there is a great deal of product innovation in the areas of Rules-Driven and Metadata-driven autogeneration One option includes using PowerCenter via an XML target to generate the required XML files later used as import mappings.. Depending on the scale and complexity of both the autogeneration-rules and the functionality of the generated components, it may be advisable to acquire a license for the PowerCenter Unstructured Data option. In conclusion, at the end of this stage the type of autogeneration should be identified and all the required technology licenses should be acquired.

2. Design the Assembly Line


It is assumed that the standard development activities in the Velocity Architect and Design phases have been undertaken and at this stage, the development team should understand the data and the value to be added to it. It should be possible to identify the patterns of data movement. The main stages in designing the assembly line are:
q q q q q q q

Manually develop a prototype Distinguish between the generic and the flow-specific components Establish the boundaries and inter-action between generated and manually built components Agree the format and syntax for the specification of the rules (usually Excel) Articulate the rules in the agreed format Incorporate component generation in the overall development process Develop the manual components (if any)

It is recommended that a prototype is manually developed for a representative subset of the sources and targets since the adoption of autogeneration techniques does not obviate the need for a re-usability strategy. Even if some components are generated rather than built, it is still necessary to distinguish between the generic and the flow-specific components. This will allow the generic functionality to be mapped onto the appropriate re-usable PowerCenter components mapplets, transformations, user defined functions etc. The manual development of the prototype also allows the scope of the autogeneration to be established. It is unlikely that every
INFORMATICA CONFIDENTIAL BEST PRACTICES 350 of 954

single required PowerCenter component can be generated; and may be restricted by the current capabilities of the PowerCenter Visio Stencil. It is necessary to establish the demarcation between generated and manually-built components. It will also be necessary to devise a customization strategy if the autogeneration is seen as a repeatable process. How are manual modifications to the generated component to be implemented? Should this be isolated in discrete components which are called from the generated components? If the autogeneration strategy is based on an application rather than the Visio stencil mapping generation option, ensure that the components you are planning to generate are consistent with the restrictions on the XML export file by referring to the product documentation.

TIP If you modify an exported XML file, you need to make sure that the XML file conforms to the structure of powrmart.dtd. You also need to make sure the metadata in the XML file conforms to Designer and Workflow Manager rules. For example, when you define a shortcut to an object, define the folder in which the referenced object resides as a shared folder. Although PowerCenter validates the XML file before importing repository objects from it, it might not catch all invalid changes. If you import into the repository an object that does not conform to Designer or Workflow Manager rules, you may cause data inconsistencies in the repository. Do not modify the powrmart.dtd file. CRCVALUE Codes Informatica restricts which elements you can modify in the XML file. When you export a Designer object, the PowerCenter Client might include a Cyclic Redundancy Checking Value (CRCVALUE) code in one or more elements in the XML file. The CRCVALUE code is another attribute in an element. When the PowerCenter Client includes a CRCVALUE code in the exported XML file, you can modify some attributes and elements before importing the object into a repository. For example, VSAM source objects always contain a CRCVALUE code, so you can only modify some attributes in a VSAM source object. If you modify certain attributes in an element that contains a CRCVALUE code, you cannot import the object

For more information, refer to the Chapter on Exporting and Importing Objects in the PowerCenter Repository Guide.

3. Build the Assembly Line


Essentially, the requirements for the autogeneration may be discerned from the XML exports of the manually developed prototype. Autogeneration Based on Visio Data Stencil (Refer to the product documentation for more information on installation, configuration and usage.) It is important to confirm that all the required PowerCenter transformations are supported by the installed version of the Stencil. The use of an external industry-standard interface such as MS Visio allows the tool to be used by Business Analysts rather than PowerCenter specialists. Apart from allowing the mapping patterns to be specified, the Stencil may also be used as a documentation tool. Essentially, there are three usage stages:
q q q

Implement the Design in a Visio template Publish the Design Generate the PC Components

INFORMATICA CONFIDENTIAL

BEST PRACTICES

351 of 954

A separate Visio template is defined for every pattern identified in the design phase. A template can be created from scratch or imported from a mapping export; an example is shown below:

The icons for transformation objects should be familiar to PowerCenter users. Less easily understood will be the concept of properties for the links (i.e. relationships) between the objects in the Stencil. These link rules define what ports propagate from one transformation to the next and there may be multiple rules in a single link. Essentially, the process of developing the template consists of identifying the dynamic components in the pattern and parameterizing them such as.
q q q q

Source and target table name Source primary key, target primary key Lookup table name and foreign keys Transformations

Once the template is saved and validated, it needs to be published which simply makes it available in formats which the generating mechanisms can understand such as:
q q

Mapping template parameter xml Mapping template xml

One of the outputs from the publishing is the template for the definition of the parameters specified in the template. An example of a modified file is shown below: <?xml version='1.0' encoding='UTF-8'?> <!DOCTYPE PARAMETERS SYSTEM "parameters.dtd"> <PARAMETERS REPOSITORY_NAME="REP_MAIN" REPOSITORY_VERSION="179" REPOSITORY_CODEPAGE="MS1252" REPOSITORY_DATABASETYPE="Oracle"> <MAPPING NAME="M_LOAD_CUSTOMER_GENERATED" FOLDER_NAME="PTM_2008_VISIO_SOURCE" DESCRIPTION="M_LOAD_CUSTOMER"> <PARAM NAME="$SRC_KEY$" VALUE="CUSTOMER_CODE" /> <PARAM NAME="$TGT$" VALUE="CUSTOMER_DIM" /> <PARAM NAME="$TGT_KEY$" VALUE="CUSTOMER_ID" /> <PARAM NAME="$SRC$" VALUE="CUSTOMER_MASTER" /> </MAPPING> <MAPPING NAME="M_LOAD_PRODUCT_GENERATED" FOLDER_NAME="PTM_2008_VISIO_SOURCE" DESCRIPTION="M_LOAD_CUSTOMER"> <PARAM NAME="$SRC_KEY$" VALUE="PRODUCT_CODE" /> <PARAM NAME="$TGT$" VALUE="PRODUCT_DIM" /> <PARAM NAME="$TGT_KEY$" VALUE="PRODUCT_ID" />
INFORMATICA CONFIDENTIAL BEST PRACTICES 352 of 954

<PARAM NAME="$SRC$" VALUE="PRODUCT_MASTER" /> </MAPPING> </PARAMETERS>

This file is only used in scripted generation. The other output from the publishing is the template in XML format. This file is only used in manual generation. There is a choice of either manual or scripted mechanisms for generating components from the published files. The manual mechanism involves the importation of the published XML template through the Mapping Template Import Wizard in the PowerCenter Designer. The parameters defined in the template are entered manually through the user interface. Alternately, the scripted process is based on a supplied command-line utility mapgen. The first stage is to manually modify the published parameter file to specify values for all the mappings to be generated. The second stage is to use PowerCenter to export source and target definitions for all the objects referenced in the parameter file. These are required in order to generate the ports. Mapgen requires the following syntax :
q q q q

<-t> Visio Drawing File <-p> ParameterFile <-o> MappingFile [-d] TableDefinitionDir

(i.e., mapping source) (i.e., parameters) (i.e., output) (i.e., metadata sources & targets)

The generated output file is imported using the standard import facilities in PowerCenter.

TIP Even if the scripted option is selected as the main generating mechanism, use the Mapping Template Import Wizard in the PC Designer to generate the first mapping; this allows the early identification of any errors or inconsistencies in the template.

Autogeneration Based on Informatica Application This strategy generates PowerCenter XML but can be implemented through either PowerCenter itself or the Unstructured Data option. Essentially, it will require the same build sub-stages as any other data integration application. The following components are anticipated:
q q q q q

Specification of the formats for source to target mapping and transformation rules definition Development of a mapping to load the specification spreadsheets into a table Development of a mapping to validate the specification and report errors Development of a mapping to generate the XML output excluding critical errors Development of a component to automate the importation of the XML output into PowerCenter

One of the main issues to be addressed is whether there is a single generation engine which deals with all of the required patterns, or a series of pattern-specific generation engines. One of the drivers for the design should be the early identification of errors in the specifications. Otherwise the first indication of any problem will be the failure of the XML output to import in PowerCenter. It is very important to define the process around the generation and to allocate responsibilities appropriately. Autogeneration Based on Java Application
INFORMATICA CONFIDENTIAL BEST PRACTICES 353 of 954

Assuming the appropriate skills are available in the development team, an alternative technique is to develop a Java application to generate the mapping XML files. The PowerCenter Mapping SDK is a java API that provides all of the elements required to generate mappings. The mapping SDK can be found in client installation directory. It contains:
q q q

The javadoc (api directory) describe all the class of the java API The API (lib directory) which contains the jar files used for mapping SDK application Some basic samples which show how java development with Mapping SDK is done

The Java application also requires a mechanism to define the final mapping between source and target structures; the application interprets this data source and combines it with the metadata in the repository in order to output the required mapping XML.

4. Implement the QA and Testing Strategies


Presumably there should be less of a requirement for QA and Testing with generated components. This does not mean that the need to test no longer exists. To some extent, the testing effort should be re-directed to the components in the Assembly line itself. There is a great deal of material in Velocity to support QA and Test activities. In particular, refer to Naming Conventions . Informatica suggests adopting a Naming Convention that distinguishes between generated and manually-built components. For more information on the QA strategy refer to Using PowerCenter Metadata Manager and Metadata Exchange Views for Quality Assurance . Otherwise, the main areas of focus for testing are:
Last updated: 26-May-08 18:26

INFORMATICA CONFIDENTIAL

BEST PRACTICES

354 of 954

Mapping Design Challenge


Optimizing PowerCenter to create an efficient execution environment.

Description
Although PowerCenter environments vary widely, most sessions and/or mappings can benefit from the implementation of common objects and optimization procedures. Follow these procedures and rules of thumb when creating mappings to help ensure optimization.

General Suggestions for Optimizing


1. Reduce the number of transformations. There is always overhead involved in moving data between transformations. 2. Consider more shared memory for large number of transformations. Session shared memory between 12MB and 40MB should suffice. 3. Calculate once, use many times.
r r r

Avoid calculating or testing the same value over and over. Calculate it once in an expression, and set a True/False flag. Within an expression, use variable ports to calculate a value that can be used multiple times within that transformation. Delete unnecessary links between transformations to minimize the amount of data moved, particularly in the Source Qualifier. This is also helpful for maintenance. If a transformation needs to be reconnected, it is best to only have necessary ports set as input and output to reconnect. In lookup transformations, change unused ports to be neither input nor output. This makes the transformations cleaner looking. It also makes the generated SQL override as small as possible, which cuts down on the amount of cache necessary and thereby improves performance.

4. Only connect what is used.


r

5. Watch the data types.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

355 of 954

r r

The engine automatically converts compatible types. Sometimes data conversion is excessive. Data types are automatically converted when types differ between connected ports. Minimize data type changes between transformations by planning data flow prior to developing the mapping. Plan for reusable transformations upfront.. Use variables. Use both mapping variables and ports that are variables. Variable ports are especially beneficial when they can be used to calculate a complex expression or perform a disconnected lookup call only once instead of multiple times. Use mapplets to encapsulate multiple reusable transformations. Use mapplets to leverage the work of critical developers and minimize mistakes when performing similar functions. Reduce the number of non-essential records that are passed through the entire mapping. Use active transformations that reduce the number of records as early in the mapping as possible (i.e., placing filters, aggregators as close to source as possible). Select appropriate driving/master table while using joins. The table with the lesser number of rows should be the driving/master table for a faster join. Redesign mappings to utilize one Source Qualifier to populate multiple targets. This way the server reads this source only once. If you have different Source Qualifiers for the same source (e.g., one for delete and one for update/insert), the server reads the source for each Source Qualifier. Remove or reduce field-level stored procedures.

6. Facilitate reuse.
r r

r r

7. Only manipulate data that needs to be moved and transformed.


r

8. Utilize single-pass reads.


r

9. Utilize Pushdown Optimization. r Design mappings so they can take advantage of the Pushdown Optimization feature. This improves performance by allowing the source and/or target database to perform the mapping logic.

Lookup Transformation Optimizing Tips

INFORMATICA CONFIDENTIAL

BEST PRACTICES

356 of 954

1. When your source is large, cache lookup table columns for those lookup tables of 500,000 rows or less. This typically improves performance by 10 to 20 percent. 2. The rule of thumb is not to cache any table over 500,000 rows. This is only true if the standard row byte count is 1,024 or less. If the row byte count is more than 1,024, then you need to adjust the 500K-row standard down as the number of bytes increase (i.e., a 2,048 byte row can drop the cache row count to between 250K and 300K, so the lookup table should not be cached in this case). This is just a general rule though. Try running the session with a large lookup cached and not cached. Caching is often faster on very large lookup tables. 3. When using a Lookup Table Transformation, improve lookup performance by placing all conditions that use the equality operator = first in the list of conditions under the condition tab. 4. Cache only lookup tables if the number of lookup calls is more than 10 to 20 percent of the lookup table rows. For fewer number of lookup calls, do not cache if the number of lookup table rows is large. For small lookup tables (i.e., less than 5,000 rows), cache for more than 5 to 10 lookup calls. 5. Replace lookup with decode or IIF (for small sets of values). 6. If caching lookups and performance is poor, consider replacing with an unconnected, uncached lookup. 7. For overly large lookup tables, use dynamic caching along with a persistent cache. Cache the entire table to a persistent file on the first run, enable the "update else insert" option on the dynamic cache and the engine never has to go back to the database to read data from this table. You can also partition this persistent cache at run time for further performance gains. 8. When handling multiple matches, use the "Return any matching value" setting whenever possible. Also use this setting if the lookup is being performed to determine that a match exists, but the value returned is irrelevant. The lookup creates an index based on the key ports rather than all lookup transformation ports. This simplified indexing process can improve performance. 9. Review complex expressions.
r

Examine mappings via Repository Reporting and Dependency Reporting within the mapping. Minimize aggregate function calls. Replace Aggregate Transformation object with an Expression Transformation object and an Update Strategy Transformation for certain types of Aggregations.

r r

Operations and Expression Optimizing Tips


1. Numeric operations are faster than string operations. 2. Optimize char-varchar comparisons (i.e., trim spaces before comparing).
INFORMATICA CONFIDENTIAL BEST PRACTICES 357 of 954

3. 4. 5. 6. 7.

Operators are faster than functions (i.e., || vs. CONCAT). Optimize IIF expressions. Avoid date comparisons in lookup; replace with string. Test expression timing by replacing with constant. Use flat files.
r

Using flat files located on the server machine loads faster than a database located in the server machine. Fixed-width files are faster to load than delimited files because delimited files require extra parsing. If processing intricate transformations, consider loading first to a source flat file into a relational database, which allows the PowerCenter mappings to access the data in an optimized fashion by using filters and custom SQL Selects where appropriate.

8. If working with data that is not able to return sorted data (e.g., Web Logs), consider using the Sorter Advanced External Procedure. 9. Use a Router Transformation to separate data flows instead of multiple Filter Transformations. 10. Use a Sorter Transformation or hash-auto keys partitioning before an Aggregator Transformation to optimize the aggregate. With a Sorter Transformation, the Sorted Ports option can be used even if the original source cannot be ordered. 11. Use a Normalizer Transformation to pivot rows rather than multiple instances of the same target. 12. Rejected rows from an update strategy are logged to the bad file. Consider filtering before the update strategy if retaining these rows is not critical because logging causes extra overhead on the engine. Choose the option in the update strategy to discard rejected rows. 13. When using a Joiner Transformation, be sure to make the source with the smallest amount of data the Master source. 14. If an update override is necessary in a load, consider using a Lookup transformation just in front of the target to retrieve the primary key. The primary key update is much faster than the non-indexed lookup override.

Suggestions for Using Mapplets


A mapplet is a reusable object that represents a set of transformations. It allows you to reuse transformation logic and can contain as many transformations as necessary. Use the Mapplet Designer to create mapplets.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

358 of 954

Mapping SDK Challenge


Understand how to create PowerCenter repository objects such as mappings, sessions and workflows using Java programming language instead of PowerCenter client tools.

Description
PowerCenters Mapping Software Developer Kit (SDK) is a set of interfaces that can be used to generate PowerCenter XML documents containing mappings, sessions and workflows. The Mapping SDK is a Java API that provides all of the elements needed to set up mappings in the repository where metadata is stored. These elements are the objects usually used in the PowerCenter Designer and Workflow Manager like source and target definitions, transformations, mapplets, mappings, tasks, sessions and workflows. The Mapping SDK can be found in the PowerCenter client installation. In the Mapping SDK directory, the following components are available:
q q q

The javadoc (api directory) that describe all the classes of the Java API The API (lib directory) which contains the jar files used for the Mapping SDK application Some basic samples which show how Java development with Mapping SDK can be done

Below is a simplified Class diagram that represents the Mapping SDK:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

359 of 954

The purpose of the Mapping SDK feature is to improve design and development efficiency for repetitive tasks during the implementation. The Mapping SDK can also be used for mapping autogeneration purposes to complete data-flow for repetitive tasks with various structures of data. This can be used to create on demand mappings with same transformations between various sources and targets. A particular advantage for a project that has been designed using the mapping autogeneration comes is with project maintenance. The project team will be able to regenerate mappings quickly using the new source or target structure definitions. The sections below are an example of a Mapping SDK implementation for mapping autogeneration purposes. Mapping auto-generation is based on a low level Java API, which means that there are many ways to create mappings. The development of such a tool requires knowledge and skills about PowerCenter objects design as well as Java program development. To implement the mapping auto-generation method, the project team should follow these tasks:
q q q q

Identify repetitive data mappings which will be common for task and methodology. Create samples of these mappings. Define where data structures are stored (e.g., database catalog, file, COBOL copybook). Develop a Java application using the mapping SDK which is able to obtain the data structure of the project and to generate the mapping defined.

Identify Repetitive Data Mappings


INFORMATICA CONFIDENTIAL BEST PRACTICES 360 of 954

In most projects there are some tasks or mappings that are similar and vary only in the structure of the data they transform. Examples of these types of mappings include:
q q q

loading a table from a flat file performing incremental loads on historical and non-historical tables extracting table data to files

During the design phase of the project, the Business Analyst and the Data Integration developer need to identify which tasks or mappings can be designed as repetitive tasks to improve the future design for similar tasks.

Create A Sample Mapping


During the design phase, the Data Integration developer must develop a sample mapping for each repetitive task that has identified. This will help to outline how the data mapping could be designed. For example, define the needed transformations, mappings, tasks and processes needed to create the data mapping. A mapping template can be used for this purpose. Frequently, the repetitive tasks correspond to one of the sample data mappings that have been defined as mapping templates in Informaticas Customer Portal.

Define The Location Where Data Structures are Stored


An important point for the mapping auto-generation method is to define where the data structure can be found that is needed to create the final mapping between the source and target structure. You can build a Java application that will build a PowerCenter mapping with dynamic source and target definitions stored in:
q q q

A set of data files A database catalog A structured file like copy COBOL or XML Schema file

The final application may contain a set of functionalities to map the source and the target structure definitions.

Develop A Java Application Using The Mapping SDK


As a final step during the build phase, develop a Java application that will create (according to the source and target structure definition) the final mapping definition that includes all of the column specifications for the source and target. This application will be based on the Mapping SDK, which provides all of the resources to create an XML file containing the mapping, session and workflow definition. This application has to be developed in such a way as to generate all of the types of mappings that were defined during the design phase.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

361 of 954

Last updated: 29-May-08 13:18

INFORMATICA CONFIDENTIAL

BEST PRACTICES

362 of 954

Mapping Templates Challenge


Mapping Templates demonstrate proven solutions for tackling challenges that commonly occur during data integration development efforts. Mapping Templates can be used to make the development phase of a project more efficient. Mapping Templates can also serve as a medium to introduce development standards into the mapping development process that developers need to follow. A wide array of Mapping Template examples can be obtained for the most current PowerCenter version from the Informatica Customer Portal. As "templates," each of the objects in Informatica's Mapping Template Inventory illustrates the transformation logic and steps required to solve specific data integration requirements. These sample templates, however, are meant to be used as examples, not as means to implement development standards.

Description
Reuse Transformation Logic
Templates can be heavily used in a data integration and warehouse environment, when loading information from multiple source providers into the same target structure, or when similar source system structures are employed to load different target instances. Using templates guarantees that any transformation logic that is developed and tested correctly, once, can be successfully applied across multiple mappings as needed. In some instances, the process can be further simplified if the source/target structures have the same attributes, by simply creating multiple instances of the session, each with its own connection/execution attributes, instead of duplicating the mapping.

Implementing Development Techniques


When the process is not simple enough to allow usage based on the need to duplicate transformation logic to load the same target, Mapping Templates can help to reproduce transformation techniques. In this case, the implementation process requires more than just replacing source/target transformations. This scenario is most useful when certain logic (i.e., logical group of transformations) is employed across mappings. In many instances this can be further simplified by making use of mapplets. Additionally user defined functions can be utilized for expression logic reuse and build complex

INFORMATICA CONFIDENTIAL

BEST PRACTICES

363 of 954

expressions using transformation language.

Transport mechanism
Once Mapping Templates have been developed, they can be distributed by any of the following procedures:
q q

Copy mapping from development area to the desired repository/folder Export mapping template into XML and import to the desired repository/folder.

Mapping template examples


The following Mapping Templates can be downloaded from the Informatica Customer Portal and are listed by subject area:

Common Data Warehousing Techniques


q q q q q q

Aggregation using Sorted Input Tracking Dimension History Constraint-Based Loading Loading Incremental Updates Tracking History and Current Inserts or Updates

Transformation Techniques
q q q q q q q q q

Error Handling Strategy Flat File Creation with Headers and Footers Removing Duplicate Source Records Transforming One Record into Multiple Records Dynamic Caching Sequence Generator Alternative Streamline a Mapping with a Mapplet Reusable Transformations (Customers) Using a Sorter

INFORMATICA CONFIDENTIAL

BEST PRACTICES

364 of 954

q q q q

Pipeline Partitioning Mapping Template Using Update Strategy to Delete Rows Loading Heterogenous Targets Load Using External Procedure

Advanced Mapping Concepts


q q q q q

Aggregation Using Expression Transformation Building a Parameter File Best Build Logic Comparing Values Between Records Transaction Control Transformation

Source-Specific Requirements
q q q

Processing VSAM Source Files Processing Data from an XML Source Joining a Flat File with a Relational Table

Industry-Specific Requirements
q q

Loading SWIFT 942 Messages.htm Loading SWIFT 950 Messages.htm

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

365 of 954

Naming Conventions Challenge


A variety of factors are considered when assessing the success of a project. Naming standards are an important, but often overlooked component. The application and enforcement of naming standards not only establishes consistency in the repository, but provides for a developer friendly environment. Choose a good naming standard and adhere to it to ensure that the repository can be easily understood by all developers.

Description
Although naming conventions are important for all repository and database objects, the suggestions in this Best Practice focus on the former. Choosing a convention and sticking with it is the key. Having a good naming convention facilitates smooth migrations and improves readability for anyone reviewing or carrying out maintenance on the repository objects. It helps them to understand the processes being affected. If consistent names and descriptions are not used, significant time may be needed to understand the workings of mappings and transformation objects. If no description is provided, a developer is likely to spend considerable time going through an object or mapping to understand its objective. The following pages offer suggested naming conventions for various repository objects. Whatever convention is chosen, it is important to make the selection very early in the development cycle and communicate the convention to project staff working on the repository. The policy can be enforced by peer review and at test phases by adding processes to check conventions both to test plans and to test execution documents.

Suggested Naming Conventions


Designer Objects Mapping Suggested Naming Conventions m_{PROCESS}_{SOURCE_SYSTEM}_{TARGET_NAME} or suffix with _ {descriptor} if there are multiple mappings for that single target table mplt_{DESCRIPTION} {update_types(s)}_{TARGET_NAME} this naming convention should only occur within a mapping as the actual target name object affects the actual table that PowerCenter will access AGG_{FUNCTION} that leverages the expression and/or a name that describes the processing being done. ASQ_{TRANSFORMATION} _{SOURCE_TABLE1}_{SOURCE_TABLE2} represents data from application source. CT_{TRANSFORMATION} name that describes the processing being done. IDQ_{descriptor}_{plan} with the descriptor describing what this plan is doing with the optional plan name included if desired. EXP_{FUNCTION} that leverages the expression and/or a name that describes the processing being done. EXT_{PROCEDURE_NAME}

Mapplet Target

Aggregator Transformation

Application Source Qualifier Transformation Custom Transformation Data Quality Transform Expression Transformation

External Procedure Transformation Filter Transformation

FIL_ or FILT_{FUNCTION} that leverages the expression or a name that describes the processing being done. Fkey{descriptor}
BEST PRACTICES 366 of 954

Flexible Target Key


INFORMATICA CONFIDENTIAL

HTTP Idoc Interpreter Idoc Prepare Java Transformation

http_{descriptor} idoci_{Descriptor}_{IDOC Type} defining what the idoc does and possibly the idoc message. idocp_{Descriptor}_{IDOC Type} defining what the idoc does and possibly the idoc message. JV_{FUNCTION} that leverages the expression or a name that describes the processing being done. JNR_{DESCRIPTION} LKP_{TABLE_NAME} or suffix with _{descriptor} if there are multiple lookups on a single table. For unconnected look-ups, use ULKP in place of LKP. MPLTI_{DESCRIPTOR} indicating the data going into the mapplet.

Joiner Transformation Lookup Transformation

Mapplet Input Transformation

Mapplet Output Transformation MPLTO_{DESCRIPTOR} indicating the data coming out of the mapplet. MQ Source Qualifier Transformation Normalizer Transformation MQSQ_{DESCRIPTOR} defines the messaging being selected.

NRM_{FUNCTION} that leverages the expression or a name that describes the processing being done. RNK_{FUNCTION} that leverages the expression or a name that describes the processing being done. RTR_{DESCRIPTOR} dmi_{Entity Descriptor}_{Secondary Descriptor} defining what entity is being loaded and a secondary description if multiple DMI objects are being leveraged in a mapping. SEQ_{DESCRIPTOR} if using keys for a target table entity, then refer to that

Rank Transformation

Router Transformation SAP DMI Prepare Sequence Generator Transformation Sorter Transformation

SRT_{DESCRIPTOR}

Source Qualifier Transformation SQ_{SOURCE_TABLE1}_{SOURCE_TABLE2}. Using all source tables can be impractical if there are a lot of tables in a source qualifier, so refer to the type of information being obtained, for example a certain type of product SQ_SALES_INSURANCE_PRODUCTS. Stored Procedure Transformation Transaction Control Transformation Union Transformation Unstructured Data Transform SP_{STORED_PROCEDURE_NAME}

TCT_ or TRANS_{DESCRIPTOR} indicating the function of the transaction control. UN_{DESCRIPTOR}

UDO_{descriptor} with the descriptor ideintifying the kind of data being parsed by the UDO transform. Update Strategy Transformation UPD_{UPDATE_TYPE(S)} or UPD_{UPDATE_TYPE(S)}_ {TARGET_NAME} if there are multiple targets in the mapping. E.g., UPD_UPDATE_EXISTING_EMPLOYEES Web Service Consumer WSC_{descriptor} XML Generator Transformation XMG_{DESCRIPTOR}defines the target message.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

367 of 954

XML Parser Transformation XML Source Qualifier Transformation

XMP_{DESCRIPTOR}defines the messaging being selected. XMSQ_{DESCRIPTOR}defines the data being selected.

Port Names
Ports names should remain the same as the source unless some other action is performed on the port. In that case, the port should be prefixed with the appropriate name. When the developer brings a source port into a lookup, the port should be prefixed with in_. This helps the user immediately identify the ports that are being input without having to line up the ports with the input checkbox. In any other transformation, if the input port is transformed in an output port with the same name, prefix the input port with in_. Generated output ports can also be prefixed. This helps trace the port value throughout the mapping as it may travel through many other transformations. If it is intended to be able to use the autolink feature based on names, then outputs may be better left as the name of the target port in the next transformation. For variables inside a transformation, the developer can use the prefix v, 'var_ or v_' plus a meaningful name. With some exceptions, port standards apply when creating a transformation object. The exceptions are the Source Definition, the Source Qualifier, the Lookup, and the Target Definition ports, which must not change since the port names are used to retrieve data from the database. Other transformations that are not applicable to the port standards are:
q q q

Normalizer - The ports created in the Normalizer are automatically formatted when the developer configures it. Sequence Generator - The ports are reserved words. Router - Because output ports are created automatically, prefixing the input ports with an I_ prefixes the output ports with I_ as well. Port names should not have any prefix. Sorter, Update Strategy, Transaction Control, and Filter - These ports are always input and output. There is no need to rename them unless they are prefixed. Prefixed port names should be removed. Union - The group ports are automatically assigned to the input and output; therefore prefixing with anything is reflected in both the input and output. The port names should not have any prefix.

All other transformation object ports can be prefixed or suffixed with:


q q q q q q

in_ or i_for Input ports o_ or _out for Output ports io_ for Input/Output ports v,v_ or var_ for variable ports lkp_ for returns from look ups mplt_ for returns from mapplets

Prefixes are preferable because they are generally easier to see; developers do not need to expand the columns to see the suffix for longer port names. Transformation object ports can also:
q q q q

Have the Source Qualifier port name. Be unique. Be meaningful. Be given the target port name.

Transformation Descriptions
This section defines the standards to be used for transformation descriptions in the Designer.
INFORMATICA CONFIDENTIAL BEST PRACTICES 368 of 954

Source Qualifier Descriptions. Should include the aim of the source qualifier and the data it is intended to select. Should also indicate if any overrides are used. If so, it should describe the filters or settings used. Some projects prefer items such as the SQL statement to be included in the description as well.

Lookup Transformation Descriptions. Describe the lookup along the lines of the [lookup attribute] obtained from [lookup table name] to retrieve the [lookup attribute name]. Where:
r r r

Lookup attribute is the name of the column being passed into the lookup and is used as the lookup criteria. Lookup table name is the table on which the lookup is being performed. Lookup attribute name is the name of the attribute being returned from the lookup. If appropriate, specify the condition when the lookup is actually executed.

It is also important to note lookup features such as persistent cache or dynamic lookup.
q

Expression Transformation Descriptions. Must adhere to the following format: This expression [explanation of what transformation does]. Expressions can be distinctly different depending on the situation; therefore the explanation should be specific to the actions being performed. Within each Expression, transformation ports have their own description in the format: This port [explanation of what the port is used for].

Aggregator Transformation Descriptions. Must adhere to the following format: This Aggregator [explanation of what transformation does]. Aggregators can be distinctly different, depending on the situation; therefore the explanation should be specific to the actions being performed. Within each Aggregator, transformation ports have their own description in the format: This port [explanation of what the port is used for].

Sequence Generators Transformation Descriptions. Must adhere to the following format: This Sequence Generator provides the next value for the [column name] on the [table name]. Where:
r r

Table name is the table being populated by the sequence number, and the Column name is the column within that table being populated.

Joiner Transformation Descriptions. Must adhere to the following format: This Joiner uses [joining field names] from [joining table names]. Where:
r

INFORMATICA CONFIDENTIAL

BEST PRACTICES

369 of 954

Joining field names are the names of the columns on which the join is done, and the
r

Joining table names are the tables being joined.


q

Normalizer Transformation Descriptions. Must adhere to the following format:: This Normalizer [explanation]. Where:
r

explanation describes what the Normalizer does.

Filter Transformation Descriptions. Must adhere to the following format: This Filter processes [explanation]. Where:
r

explanation describes what the filter criteria are and what they do.

Stored Procedure Transformation Descriptions. Explain the stored procedures functionality within the mapping (i.e., what does it return in relation to the input ports?). Mapplet Input Transformation Descriptions. Describe the input values and their intended use in the mapplet. Mapplet Output Transformation Descriptions. Describe the output ports and the subsequent use of those values. As an example, for an exchange rate mapplet, describe what currency the output value will be in. Answer the questions like: is the currency fixed or based on other data? What kind of rate is used? is it a fixed inter-company rate? an inter-bank rate? business rate or tourist rate? Has the conversion gone through an intermediate currency? Update Strategies Transformation Descriptions. Describe the Update Strategy and whether it is fixed in its function or determined by a calculation. Sorter Transformation Descriptions. Explanation of the port(s) that are being sorted and their sort direction. Router Transformation Descriptions. Describes the groups and their functions. Union Transformation Descriptions. Describe the source inputs and indicate what further processing on those inputs (if any) is expected to take place in later transformations in the mapping. Transaction Control Transformation Descriptions. Describe the process behind the transaction control and the function of the control to commit or rollback. Custom Transformation Descriptions. Describe the function that the custom transformation accomplishes and what data is expected as input and what data will be generated as output. Also indicate the module name (and location) and the procedure which is used. External Procedure Transformation Descriptions. Describe the function of the external procedure and what data is expected as input and what data will be generated as output. Also indicate the module name (and location) and the procedure that is used. Java Transformation Descriptions. Describe the function of the java code and what data is expected as input and what data is generated as output. Also indicate whether the java code determines the object to be an Active or Passive transformation. Rank Transformation Descriptions. Indicate the columns being used in the rank, the number of records returned from the rank, the rank direction, and the purpose of the transformation. XML Generator Transformation Descriptions. Describe the data expected for the generation of the XML and indicate the purpose of the XML being generated. XML Parser Transformation Descriptions. Describe the input XML expected and the output from the parser and indicate the
BEST PRACTICES 370 of 954

INFORMATICA CONFIDENTIAL

purpose of the transformation.

Mapping Comments
These comments describe the source data obtained and the structure file, table or facts and dimensions that it populates. Remember to use business terms along with such technical details as table names. This is beneficial when maintenance is required or if issues arise that need to be discussed with business analysts.

Mapplet Comments
These comments are used to explain the process that the mapplet carries out. Always be sure to see the notes regarding descriptions for the input and output transformation.

Repository Objects
Repositories, as well as repository level objects, should also have meaningful names. Repositories should prefix with either L_ for local or G for global and a descriptor. Descriptors usually include information about the project and/or level of the environment (e.g., PROD, TEST, DEV).

Folders and Groups


Working folder names should be meaningful and include project name and, if there are multiple folders for that one project, a descriptor. User groups should also include project name and descriptors, as necessary. For example, folder DW_SALES_US and DW_SALES_UK could both have TEAM_SALES as their user group. Individual developer folders or non-production folders should prefix with z_ so that they are grouped together and not confused with working production folders.

Shared Objects and Folders


Any object within a folder can be shared across folders and maintained in one central location. These objects are sources, targets, mappings, transformations, and mapplets. To share objects in a folder, the folder must be designated as shared. In addition to facilitating maintenance, shared folders help reduce the size of the repository since shortcuts are used to link to the original, instead of copies. Only users with the proper permissions can access these shared folders. These users are responsible for migrating the folders across the repositories and, with help from the developers, for maintaining the objects within the folders. For example, if an object is created by a developer and is to be shared, the developer should provide details of the object and the level at which the object is to be shared before the Administrator accepts it as a valid entry into the shared folder. The developers, not necessarily the creator, control the maintenance of the object, since they must ensure that a subsequent change does not negatively impact other objects. If the developer has an object that he or she wants to use in several mappings or across multiple folders, like an Expression transformation that calculates sales tax, the developer can place the object in a shared folder. Then use the object in other folders by creating a shortcut to the object. In this case, the naming convention is sc_ (e.g., sc_EXP_CALC_SALES_TAX). The folder should prefix with SC_ to identify it as a shared folder and keep all shared folders grouped together in the repository.

Workflow Manager Objects


WorkFlow Objects Session Command Object Worklet Workflow Suggested Naming Convention s_{MappingName} cmd_{DESCRIPTOR} wk or wklt_{DESCRIPTOR} wkf or wf_{DESCRIPTOR}

INFORMATICA CONFIDENTIAL

BEST PRACTICES

371 of 954

Email Task: Decision Task: Assign Task: Timer Task: Control Task:

email_ or eml_{DESCRIPTOR} dcn_ or dt_{DESCRIPTOR} asgn_{DESCRIPTOR} timer_ or tmr_{DESCRIPTOR} ctl_{DESCRIPTOR}Specify when and how the PowerCenter Server is to stop or abort a workflow by using the Control task in the workflow. wait_ or ew_{DESCRIPTOR}Waits for an event to occur. Once the event triggers, the PowerCenter Server continues executing the rest of the workflow. raise_ or er_{DESCRIPTOR} Represents a user-defined event. When the PowerCenter Server runs the Event-Raise task, the Event-Raise task triggers the event. Use the Event-Raise task with the Event-Wait task to define events.

Event Wait Task:

Event Raise Task:

ODBC Data Source Names


All Open Database Connectivity (ODBC) data source names (DSNs) should be set up in the same way on all client machines. PowerCenter uniquely identifies a source by its Database Data Source (DBDS) and its name. The DBDS is the same name as the ODBC DSN since the PowerCenter Client talks to all databases through ODBC. Also be sure to setup the ODBC DSNs as system DSNs so that all users of a machine can see the DSN. This approach ensures that there is less chance of a discrepancy occuring among users when they use different (i.e., colleagues') machines and have to recreate a new DSN when they use a separate machine. If ODBC DSNs are different across multiple machines, there is a risk of analyzing the same table using different names. For example, machine1 has ODBS DSN Name0 that points to database1. TableA gets analyzed in on machine 1. TableA is uniquely identified as Name0.TableA in the repository. Machine2 has ODBS DSN Name1 that points to database1. TableA gets analyzed in on machine 2. TableA is uniquely identified as Name1.TableA in the repository. The result is that the repository may refer to the same object by multiple names, creating confusion for developers, testers, and potentially end users. Also, refrain from using environment tokens in the ODBC DSN. For example, do not call it dev_db01. When migrating objects from dev, to test, to prod, PowerCenter can wind up with source objects called dev_db01 in the production repository. ODBC database names should clearly describe the database they reference to ensure that users do not incorrectly point sessions to the wrong databases.

Database Connection Information


Security considerations may dictate using the company name of the database or project instead of {user}_{database name}, except for developer scratch schemas, which are not found in test or production environments. Be careful not to include machine names or environment tokens in the database connection name. Database connection names must be very generic to be understandable and ensure a smooth migration. The naming convention should be applied across all development, test, and production environments. This allows seamless migration of sessions when migrating between environments. If an administrator uses the Copy Folder function for migration, session information is also copied. If the Database Connection information does not already exist in the folder the administrator is copying to, it is also copied. So, if the developer uses connections with names like Dev_DW in the development repository, they are likely to eventually wind up in the test, and even the production repositories as the folders are migrated. Manual intervention is then necessary to change connection names, user names, passwords, and possibly even connect strings. Instead, if the developer just has a DW connection in each of the three environments, when the administrator copies a folder from the development environment to the test environment, the sessions automatically use the existing connection in the test repository. With the right naming convention, you can migrate sessions from the test to production repository without manual intervention.
INFORMATICA CONFIDENTIAL BEST PRACTICES 372 of 954

TIP At the beginning of a project, have the Repository Administrator or DBA setup all connections in all environments based on the issues discussed in this Best Practice. Then use permission options to protect these connections so that only specified individuals can modify them. Whenever possible, avoid having developers create their own connections using different conventions and possibly duplicating connections.

Administration Console Objects


Administration console objects such as domains, nodes, and services should also have meaningful names. Object Domain Node Recommended Naming Convention DOM_ or DMN_[PROJECT]_[ENVIRONMENT] Example DOM_PROCURE_DEV

NODE[#]_[SERVER_NAME]_ [optional_descriptor] NODE02_SERVER_rs_b (backup node for the repository service)

Services: - Integration - Repository INT_SVC_[ENVIRONMENT]_[optional descriptor] INT_SVC_DEV_primary REPO_SVC_[ENVIRONMENT]_[optional descriptor] REPO_SVC_TEST

- Web Services Hub

WEB_SVC_[ENVIRONMENT]_[optional descriptor] WEB_SVC_PROD

PowerCenter PowerExchange Application/Relational Connections


Before the PowerCenter Server can access a source or target in a session, you must configure connections in the Workflow Manager. When you create or modify a session that reads from, or writes to, a database, you can select only configured source and target databases. Connections are saved in the repository. For PowerExchange Client for PowerCenter, you configure relational database and/or application connections. The connection you configure depends on the type of source data you want to extract and the extraction mode (e.g., PWX[MODE_INITIAL]_[SOURCE]_ [Instance_Name]). The following table shows some examples. Source Type/ Extraction Mode DB2/390 Bulk Mode DB2/390 Change Mode DB2/390 Real Time Mode IMS Batch Mode IMS Change Mode Application Connection/ Relational Connection Relational Application Connection Type Recommended Naming Convention PWX DB2390 PWX DB2390 CDC Change PWX DB2390 CDC Real Time PWXB_DB2_Instance_Name PWXC_DB2_Instance_Name

Application

PWXR_DB2_Instance_Name

Application Application

PWX NRDB Batch PWXB_IMS_ Instance_Name PWX NRDB CDC Change PWXC_IMS_ Instance_Name

INFORMATICA CONFIDENTIAL

BEST PRACTICES

373 of 954

IMS Real Time

Application

PWX NRDB CDC Real Time

PWXR_IMS_ Instance_Name

Oracle Change Mode Application

PWX Oracle CDC PWXC_ORA_Instance_Name Change PWX Oracle CDC PWXR_ORA_Instance_Name Real

Oracle Real Time

Application

PowerCenter PowerExchange Target Connections


The connection you configure depends on the type of target data you want to load. Target Type Connection Type Recommended Naming Convention

DB2/390

PWX DB2390 relational database PWXT_DB2_Instance_Name connection PWX DB2400 relational database PWXT_DB2_Instance_Name connection

DB2/400

Last updated: 05-Dec-07 16:20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

374 of 954

Naming Conventions - B2B Data Transformation Challenge


As with any development process, the use of clear, consistent, and documented naming conventions contributes to the effective use of Informatica B2B Data Transformation. The purpose of this document is to provide suggested naming conventions for the major structural elements of B2B Data Transformation solutions.

Description
The process of creating a B2B Data Transformation solution consists of several logical phases, each of which has implications for naming conventions. Some of these naming conventions are based upon best practices discovered during the creation of B2B Data Transformation solutions; others are restrictions imposed on the naming of solution artifacts that are due to both the use of the underlying file system and the need to make solutions callable from a wide variety of host runtime and development environments. The main phases involved in the construction of a B2B Data Transformation solution are: 1. The creation of one more transformation projects using the B2B Data Transformation Studio (formerly known as ContentMaster Studio) authoring environment. A typical solution may involve the creation of many transformation projects. 2. The publication of the transformation projects as transformation services. 3. The deployment of the transformation services. 4. The creation/configuration of the host integration environment to invoke the published transformation services. Each of these phases has implications for the naming of transformation solution components and artifacts (i.e., projects, TGP scripts, schemas, published services). Several common patterns occur in B2B Data Transformation solutions that have implications for naming:
q

Many components are realized physically as file system objects such as files and directories. For maximum compatibility and portability, it is desirable to name these objects so that they can be transferred between Windows, UNIX and other platforms without having to rename the objects so as to conform to different file system conventions. Inputs and outputs to and from B2B Data Transformation services are often files or entities designated by URLs. Again restrictions of underlying file systems play an important role here. B2B Data Transformation solutions are designed to be embeddable that is callable from a host application or environment through the use of scripts, programming language APIs provided for languages such as C, C# and Java, and through the use of agents for PowerCenter and other platforms. Hence some of the naming conventions are based on maximizing usability of transformation services from within various host environments or APIs. Within B2B Data Transformation projects, most names and artifacts are global the scope of names is global to the project.

B2B Data Transformation Studio Designer


B2B Data Transformation Studio is the user interface for the development of B2B Data Transformation solutions. It is based on the open source Eclipse environment and inherits many of its characteristics regarding project naming and structure. The workspace is organized as a set of sub-directories, with one sub-directory representing each project. A specially designated directory named .metatdata is used to hold metadata about the current workspace. For more information about Studio Designer and the workspace refer to Establishing a B2B Data Transformation Development Architecture .

INFORMATICA CONFIDENTIAL

BEST PRACTICES

375 of 954

At any common level of visibility, B2B Data Transformation requires that all elements have distinct names. Thus no two projects within a repository or workspace may share the same name. Likewise, no two TGP script files, XML schemas, global parser, mapper, serializer or variable definition may share the same name. Within a transformation (such as parser, mapper or serializer) groupings, actions or subsections of a transformation may be assigned names. In this context, the name does not strictly identify the section but is used as both a developer convenience and as a way to identify the section in the event file. In this case, names are allowed to be duplicated and often the name serves as a shorthand comment about the section. In these cases, there are no restrictions on the name although it is recommended that the name is unique, short and intuitively identifies the section. Often the name may be used to refer to elements in the specification (such as Map 835 ISA Segment). Contrary to the convention for global names, spaces are often used for readability. To distinguish between sub-element names that are only used within transformations, and the names of entry points, scripts and variables that are used as service parameters etc., refer to these names as public names.

B2B Data Transformation Studio Best Practices


As B2B Data Transformation Studio will load all projects in the current workspace into the studio environment, keeping all projects under design in a single workspace leads to both excessive memory usage and logical clutter between transformations belonging to different, possibly unrelated, solutions. Note: B2B Data Transformation Studio allows for the closing of projects to reduce memory consumption. While this aids with memory consumption it does not address the logical organization aspects of using separate workspaces. Use Separate Workspaces for Separate Solutions For distinct logical solutions, it is recommended to use separate logical workspaces to organize projects relating to separate solutions. Refer to Establishing a B2B Data Transformation Development Architecture for more information. Create Separate Transformation Projects for Each Distinct Service From a logical organization perspective, it is easier to manage data transformation solutions if only one primary service is published from each project. Secondary services from the same project should be reserved for the publication of test or troubleshooting variations of the same primary service. The one exception to this should be where multiple services are substantially the same with the same transformation code but with minor differences to inputs. One alternative to publishing multiple services from the same project is to publish a shared service which is then called by the other services in order to perform the common transformation routines. For ease of maintenance, it is often desirable to name the project after the primary service which it publishes. While these do not have to be the same, it is a useful convention and simplifies the management of projects. Use Names Compatible with Command Line Argument Formats When a transformation service is invoked at runtime, it may be invoked on the command line (via cm_console), via .Net or Java Apis, via integration agents to invoke a service from a hosting platform such as WebMethods, BizTalk or IBM ProcessServer or from PowerCenter via the UDO option for PowerCenter. Use Names Compatible with Programming Language Function Names While the programming APIs allow for the use of any string as the name, to simplify interoperability with future APIs and command line tools, the service name should be compatible with the names for C# and Java variable names, and with argument names for Windows, Unix and other OS command line arguments.
INFORMATICA CONFIDENTIAL BEST PRACTICES 376 of 954

Use Names Compatible with File System Naming on Unix and Windows Due to the files produced behind the scenes, the published service name and project names needs to be compatible with the naming conventions for file and directory names on their target platforms. To allow for optimal cross platform migration in the future, names should be chosen so as to be compatible with file naming restrictions on Windows, Unix and other platforms. Do Not Include Version or Date Information in Public Names It is recommended that project names, published service names, names of publicly accessible transformations and other public names do not include version numbers or date of creation information. Due to the way in which B2B Data Transformation operates, the use of dates or version numbers would make it difficult to use common source code control systems to track changes to projects. Unless the version corresponds to a different version of a business problem such as dealing with two different versions of an HL7 specification - it is recommended that names do not include version or date information.

Naming B2B Data Transformation Projects


When a project is created, the user is prompted for the project name.

Project names will be used by default as the published service name. Both, the directory for the project within a
INFORMATICA CONFIDENTIAL BEST PRACTICES 377 of 954

workspace and the main cmw project file name will be based on the project name. Due to the recommendation that a project name is used to define the published service name, the project name should not conflict with the name of an existing service unless the project publishes that service. Note: B2B Data Transformation disallows the use of $, , ~, , ^, *, ?, >, < , comma, `, \, /, ;, | in project names. Project naming should be clear and consistent within both a repository and workspace. The exact approach to naming will vary depending on an organizations needs.

Project Naming Best Practices


Project Names Must Be Unique Across Workspaces in Which They Occur Also if project generated services will be deployed onto separate production environments, the naming of services will need to be unique on those environments also. Do Not Name a Project after a Published Service, unless the Project Produces that Published Service This requirement can be relaxed if service names distinct from project names are being used. Do Not name a Project .metadata This will conflict with the underlying Eclipse metadata. Do Not include Version or Date Information in Project Names While it may be appealing to use version or date indicators in project names, the ideal solution for version tracking of services is to use a source control system such as CVS, Visual Studio SourceSafe, Source Depot or one of the many other commercially available or open-source source control systems. Consider Including the Source Format in the Name If transformations within a project will operate predominantly on one primary data source format, including the data source in the project name may be helpful. For example: TranslateHipaa837ToXml Consider Including the Target Format in the Name If transformations within a project will produce predominantly one target data format, including the data format in the project name may be helpful. For example: TranslateCobolCopybookToSwift Use Short, Descriptive Project Names Include enough descriptive information within the project name to indicate its function. Remember that the project name will also determine the default published service name. For ease of readability in B2B Data Transformation studio, it is also recommended to keep project names to 80 characters or less. Consider also conforming to C identifier names (combinations of a-z, A-Z, 0-9, _) which should provide maximum
INFORMATICA CONFIDENTIAL BEST PRACTICES 378 of 954

conformance. Keep Project Names Compatible with File and Directory Naming Restrictions on Unix, Windows and other Platforms As project names determine file and directory names for a variety of solution artifacts, it is highly recommended that project names conform to file name restrictions across a variety of file systems. While it is possible to use invalid Unix file names as project names on Windows, and invalid Windows file names on Unix projects, it is recommended to avoid OS file system conflicts where possible to maximize future portability. More detailed file system restrictions are identified in the appendix. Briefly, these include:
q

Do not use system file names such as CON, PRN, AUX, CLOCK$, NUL,COM0, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT0, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9 Do not use reserved eclipse names such as .metadata Do not use characters such as |\?*<":>+[]/ or control characters Optionally exclude spaces and other whitespace characters from service names

q q q

Use Project Names Compatible with Deployed Service Names As it is recommended that where possible service names are the same as the project that produces them, names of projects should also follow the service naming recommendations for command line parameters, and API identifiers.

Naming Published Services


When a project is published, it will be by default have the same name as the project from which it was published.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

379 of 954

Many of the restrictions for project names should be observed and if possible, the service should be named after the project name.

Published Service Naming Best Practices


Service Names must be Unique across the Environment on which they will be Deployed Allow for Service Names to be used as Command Line Parameters The B2B Data Transformation utility cm_console provides for quick testing of published services. It takes as its first argument, the name of the service to invoke. For each of use of cm_console, the project name should not include spaces, tabs or newlines, single or double quotes, characters such as |, , ;, %, $, >, \, / Allow for Service Names to be used as Programming Language Identifiers While B2B Data Transformation currently allows for the service name to be passed in as an arbitrary string when calling the Java and .Net APIs, other agents may expose the service as a function or method in their platform. For maximum compatibility it is recommended that service names conform to the rules for C identifiers begin with a letter, allowing combinations of 0-9, A-Z, a-z and _ only. It is also necessary to consider if the host environment distinguishes between alpha character case when naming variables. Some application platforms may not distinguish between testService, testservice and TESTSERVICE. Allow for Service Names to be used as Web Service Names The WSDL specification allows for the use of letters, digits, ., -, _, : , combining chars and extenders to be used as a web service name (or any XML nmtoken valued attribute). B2B Data Transformation does not permit use of : in a project name so it is recommended that names be kept to a combination of letters, digits, ., -, _ if they are to be used as web services. Conforming to C identifier names will guarantee compatibility. Keep Service Names Compatible with File and Directory Naming Restrictions on Unix, Windows and other Platforms As service names determine file and directory names for a variety of solution artifacts, it is highly recommended that service names conform to file name restrictions across a variety of file systems. While it is possible to use invalid Unix file names as service names on Windows, and invalid Windows file names on Unix services, it is recommended to avoid OS file system conflicts where possible to maximize future portability. More detailed file system restrictions are identified in the appendix below. Briefly, these include:
q

Do not use system file names such as CON, PRN, AUX, CLOCK$, NUL,COM0, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT0, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9 Do not use reserved eclipse names such as .metadata Do not use characters such as |\?*<":>+[]/ or control characters Optionally exclude spaces and other whitespace characters from service names

q q q

Naming Transformation Script Files (TGP Scripts)


TGP scripts have the naming restrictions common to all files on the platform on which they are being deployed.
INFORMATICA CONFIDENTIAL BEST PRACTICES 380 of 954

Naming Transformation Components


Transformation components such as parsers, mappers, serializers, variables etc., must be unique within a project. There is a single global namespace within any B2B Data Transformation project for all transformation components. One exception exists to this global namespace for component names. That is sequences of actions within a component such as a mapper or parser may be given a name. In this case the name is only used for commentary purposes and to assist in matching events to the sequence of the script actions that produced the event. For these sub-component names, no restrictions apply although it is recommended that the names are kept short to ease browsing in the events viewer. The remarks attribute should be used for longer descriptive commentary on the actions taken.

Transformation Component Naming Best Practices


Use Short Descriptive Names Names of components will show up in event logs, error output and other tracing and logging mechanisms. Keeping names short will ease the need for large amounts of screen real estate when browsing the event view for debugging. Incorporate Source and Target Formats in the Name Optionally Use Prefixes to Annotate Components Used for Internal Use Only When a component such as a parser mapper, is used for internal purposes only, it may be useful to use names of components that are prefixed with a letter sequence indicating the type of component. Type of Component Variable Prefix v Notes Do not adorn variables used for external service parameters Alternatively use descriptive name MapXToY Alternatively use descriptive name i.e ParseMortgageApplication Alternatively use descriptive name i.e RemoveWhitespace Alternatively use descriptive name Serialize837 Alternatively use name XToY describing preprocessing

Mapper Parser

map psr

Transformer

tr

Serializer Preprocessor

ser pr

In addition, names for components should take into account the following suggested rules: 1. Limit names to a reasonably short length. A limit of 40 characters is suggested. 2. Consider using the name of the input and or output data. 3. Consider limiting names to alphabetic characters, underscores, and numbers.
INFORMATICA CONFIDENTIAL BEST PRACTICES 381 of 954

Variables Exposed as Service Parameters Should be Unadorned When a variable is being used to hold a service parameter, no prefix should be used. Use a reasonably short descriptive name instead.

XML Schema Naming


In many B2B Data Transformation solution scenarios, the XML schemas which are the source or target of transformations are defined externally and control over the naming and style of schema definition is limited. However, sometimes a transformation project may require one or more intermediate schemas. The following best practices may help with the use of newly created XML schemas in B2B Data Transformation projects. Use a Target Namespace Using all no namespace schemas leads to a proliferation of types within the B2B Data Transformation studio environment under a single default namespace. By using namespaces on intermediate schemas it can reduce the logical clutter in addition to making intermediate schemas more re-usable. Always Qualify the XML Schema Namespace Qualify the XML Schema namespace even when using qualified elements and attributes for the domain namespace. It makes schema inclusion and import simpler. Consider the use of Explicit Named Complex Types vs. Anonymous Complex Types The use of anonymous complex types reduces namespace clutter in PowerExchange studio. However when multiple copies of schema elements are needed, having the ability to define variables of a complex type simplifies the creation on many transformations. By default, a transformation project allows for the existence of one copy of a schema at a time. Through the use of global complex types, additional variables may be defined to hold secondary copies for interim processing. Example: Use of anonymous type: <xsd:element name=Book> <xsd:complexType> <xsd:sequence> <xsd:element name=Title type=xsd:string/> <xsd:element name=Author type=xsd:string/> </xsd:sequence> </xsd:complexType> </element> Use of global type: <xsd:complexType name=Publication> <xsd:sequence> <xsd:element name=Title type=xsd:string/> <xsd:element name=Author type=xsd:string/> </xsd:sequence> </xsd:complexType>
INFORMATICA CONFIDENTIAL BEST PRACTICES 382 of 954

<xsd:element name=Book type=Publication/> Through the use of the second form of the definition, we can create a variable of the type Publication.

Appendix: File Name Restrictions On Different Platforms


Reserved Characters and Words
Many operating systems prohibit control characters from appearing in file names. Unix-like systems are an exception, as the only control character forbidden in file names is the null character, as that's the end-of-string indicator in C. Trivially, Unix also excludes the path separator / from appearing in filenames. Some operating systems prohibit some particular characters from appearing in file names: Character / \ ? % Name slash backslash Reason used as a path name component separator in Unix-like, MS-DOS and Windows. treated the same as slash in MS-DOS and Windows, and as the escape character in Unix systems (see Note below)

question mark used as a wildcard in Unix, and Windows; marks a single character. percent sign used as a wildcard in RT-11; marks a single character. used as a wildcard in Unix, MS-DOS, RT-11, VMS and Windows. Marks any sequence of characters (Unix, Windows, later versions of MS-DOS) or any sequence of characters in either the basename or extension (thus "*.*" in early versions of MS-DOS means "all files". used to determine the mount point / drive on Windows; used to determine the virtual device or physical device such as a drive on RT-11 and VMS; used as a pathname separator in classic Mac OS. Doubled after a name on VMS, indicates the DECnet nodename (equivalent to a NetBIOS (Windows networking) hostname preceded by "\\".) designates software pipelining in Windows.

asterisk

colon

| " < >

vertical bar

quotation mark used to mark beginning and end of filenames containing spaces in Windows. less than greater than used to redirect input, allowed in Unix filenames. used to redirect output, allowed in Unix filenames. allowed but the last occurrence will be interpreted to be the extension separator in VMS, MS-DOS and Windows. In other OSes, usually considered as part of the filename, and more than one full stop may be allowed.

period

Note: Some applications on Unix-like systems might allow certain characters but require them to be quoted or escaped; for example, the shell requires spaces, <, >, |, \ and some other characters such as : to be quoted:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

383 of 954

five\ and\ six\<seven (example of escaping) 'five and six<seven' or "five and six<seven" (examples of quoting) In Windows the space and the period are not allowed as the final character of a filename. The period is allowed as the first character, but certain Windows applications, such as Windows Explorer, forbid creating or renaming such files (despite this convention being used in Unix-like systems to describe hidden files and directories). Among workarounds are using different explorer applications or saving a file from an application with the desired name. Some file systems on a given operating system (especially file systems originally implemented on other operating systems), and particular applications on that operating system, may apply further restrictions and interpretations. See comparison of file systems for more details on restrictions imposed by particular file systems. In Unix-like systems, MS-DOS, and Windows, the file names "." and ".." have special meanings (current and parent directory respectively). In addition, in Windows and DOS, some words might also be reserved and can not be used as filenames. For example, DOS Device files: CON, PRN, AUX, CLOCK$, NUL COM0, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9 LPT0, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. Operating systems that have these restrictions cause incompatibilities with some other filesystems. For example, Windows will fail to handle, or raise error reports for, these legal UNIX filenames: aux.c, q"uote"s.txt, or NUL.txt.

Comparison of File Name Limitations


Alphabetic Case Sensitivity caseinsensitive casedestruction Allowed Character Set Reserved Characters Reserved Words

System

Max Length

Comments

MS-DOS FAT

AZ 09 - _

all except allowed

8+3

Win95 VFAT case-insensitive

any

|\?*<":>+[]/ control characters |\?*<":>/ control characters

255

WinXP NTFS

optional

any

aux, con, prn

255

OS/2 HPFS

caseinsensitive casepreservation

any

|\?*<":>/

254

INFORMATICA CONFIDENTIAL

BEST PRACTICES

384 of 954

Mac OS HFS

caseinsensitive casepreservation

any

255

Finder is limited to 31 characters

Mac OS HFS+

caseinsensitive casepreservation

any

: on disk, in classic Mac OS, and at the Carbon layer in Mac OS X; / at the Unix layer in Mac OS X

255

Mac OS 8.1 - Mac OS X

most UNIX file systems

case-sensitive casepreservation case-sensitive casepreservation

any except reserved

/ null

255

a leading . means ls and file managers will not by default show the file a leading . indicates a "hidden" file

early UNIX (AT&T)

any

14

POSIX "Fully case-sensitive portable casefilenames"[2] preservation

AZaz09. _-

/ null

Filenames to avoid include: a. out, core, . profile, . history, .cshrc

14

hyphen must not be first character

BeOS BFS

case-sensitive

UTF-8

255 Flat filesystem with no subdirs. A full "file specification" includes device, filename and extension (file type) in the format: dev:filnam.ext. a full "file specification" includes nodename, diskname, directory/ies, filename, extension and version in the format: OURNODE::MYDISK: [THISDIR.THATDIR] FILENAME.EXTENSION;2 Directories can only go 8 levels deep. 8 directory levels max (for Level 1 conformance)

DEC PDP-11 case-insensitive RT-11

RADIX-50

6+3

DEC VAX VMS

case-insensitive

AZ 09 _

32 per component; earlier 9 per component; latterly, 255 for a filename and 32 for an extension. 255

ISO 9660

case-insensitive AZ 09 _ .

Last updated: 30-May-08 22:03

INFORMATICA CONFIDENTIAL

BEST PRACTICES

385 of 954

Naming Conventions - Data Quality Challenge


As with any other development process, the use of clear, consistent, and documented naming conventions contributes to the effective use of Informatica Data Quality (IDQ). This Best Practice provides suggested naming conventions for the major structural elements of the IDQ Designer and IDQ Plans.

Description
IDQ Designer
The IDQ Designer is the user interface for the development of IDQ plans. Each IDQ plan holds the business rules and operations for a distinct process. IDQ plans may be constructed for use inside the IDQ Designer (a runtime plan), using the athanor-rt command line utility (also runtime), or within an integration with PowerCenter (a real-time plan). IDQ requires that each IDQ plan belong to a project. Optionally, plans may be organized in folders within a project. Folders may be nested to span more than one level. The organizational structure of IDQ is summarized below. Element Repository Project Folder Plan Parent None. This is the top level organization structure. Repository. There may be multiple projects in a repository. Project or Folder. Folders may be nested. Project or Folder.

At any common level of visibility, IDQ requires that all elements have distinct names. Thus no two projects within a repository may share the same name. Likewise, no two folders at the same level within a project may share the same name. The rule also applies to plans within the same folder. IDQ will not permit an element to be renamed if the new name would conflict with an existing element at the same level. A dialog will explain the error.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

386 of 954

To prevent naming conflicts when an element is copied, it will be prefixed with Copy of if it is pasted at the same level as the source of the copy. If the length of the new name is longer than the allowed length for names of the type of element, the name will be truncated.

Naming Projects
When a project is created, it will be by default have the name New Project.

Project naming should be clear and consistent within a repository. The exact approach to naming will vary depending on an organizations needs. Suggested naming rules include: 1. Limit project names to 22 characters if possible. The limit imposed by the repository is 30 characters. Limiting project names to 22 characters allows Copy of to be prefixed to copies of a project without truncating characters. 2. Include enough descriptive information within the project name so an unfamiliar user will have a reasonable idea of what plans may be included in the project. 3. If plans within a project will operate on only one data source, including the data source in the project name may be helpful. 4. If abbreviations are used, they should be consistent and documented.

Naming Folders
When a new project is created, by default it will contain four folders, named Consolidation, Matching, Profiling, and Standardization.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

387 of 954

This naming convention for folders tracks the major types of IDQ plans. While the default naming convention may prove satisfactory in many cases, it imposes an organizational structure for plans that may not be optimal. Therefore, another naming convention may make more sense in a particular circumstance. Naming guidelines for folders include: 1. Limit folder names to 42 characters if possible. The limit imposed by the repository is 50 characters. Limiting folder names to 42 characters allows Copy of to be prefixed to copies of a folder without truncating characters. 2. Include enough descriptive information within the folder name so an unfamiliar user will have a reasonable idea of what plans may be included in the folder. 3. If abbreviations are used, they should be consistent and documented.

Naming Plans
When a new plan is created, the user is required to select from one of the four main plan classifications, Analysis, Matching, Standardization, or Consolidation. By default, the new plan name will correspond to the option selected.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

388 of 954

Including the plan type as part of the plan name is helpful in describing what the plan does. Other suggested naming rules include: 1. Limit plan names to 42 characters if possible. The limit imposed by the repository is 50 characters. Limiting plan names to 42 characters allows Copy of to be prefixed to copies of a plan without truncating characters. 2. Include enough descriptive information within the plan name so an unfamiliar user will have a reasonable idea of what the plan does at a high level. 3. While the project and folder structure will be visible within the IDQ Designer and will be required when using athanor-rt, it is not as readily visible within PowerCenter. Therefore, repetition of the information conveyed by the project and folder names may be advisable. 4. If abbreviations are used, they should be consistent and documented.

Naming Components
Within the Designer, component types may be identified by their unique icons as well as by hovering over a component with a mouse.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

389 of 954

However, the component has no visible name at this level. It is only after opening a component for viewing that the components name becomes visible.

It is suggested that component names be prefixed with an acronym identifying the component type. While less critical than field naming, as discussed below, using a prefix allows for consistent naming, for clarity, and it makes field naming more efficient in some cases. Suggested prefixes are listed below. Component Address Validator Bigram Character Labeller Prefix AV_ BG_ CL_

INFORMATICA CONFIDENTIAL

BEST PRACTICES

390 of 954

Context Parser Edit Distance Hamming Distance Jaro Distance Merge Mixed Field Matcher Nysiis Profile Standardizer Rule Based Analyzer Scripting Search Replace Soundex Splitter To Upper Token Labeller Token Parser Weight Based Analyzer Word Manager

CP_ ED_ HD_ JD_ MG_ MFM_ NYS_ PS_ RBA_ SC_ SR_ SX_ SPL_ TU_ TL_ TP_ WBA_ WM_

In addition, names for components should take into account the following suggested rules: 1. Limit names to a reasonably short length. A limit of 32 characters is suggested. In many cases, component names are also useful for field names, and databases limit field lengths at varying sizes. 2. Consider using the name of the input field or at least the field type. 3. Consider limiting names to alphabetic characters, spaces, underscores, and numbers. This will make the corresponding field names compatible with most likely output destinations. 4. If the component type abbreviation itself is not sufficient to identify what the component does, include an identifier for the function of the component in its name. 5. If abbreviations are used, they should be consistent and documented.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

391 of 954

Naming Dictionaries
Dictionaries may be given any name suitable for the operating system on which they will be used. It is suggested that dictionary naming consider the following rules: 1. Limit dictionary names to characters permitted by the operating system. If a dictionary is to be used on both Windows and UNIX, avoid using spaces. 2. If a dictionary supplied by Informatica is to be modified, it is suggested that the dictionary be renamed and/ or moved to a new folder. This will avoid accidentally overwriting the modifications when an update is installed. 3. If abbreviations are used, they should be consistent and documented.

Naming Fields
Careful field naming is probably the most critical standard to follow when using IDQ.
q

IDQ requires that all fields output by components have unique names; a name cannot be carried through from component to component. The power of IDQ leads to complex plans with many components. IDQ does not have the data lineage feature of PowerCenter, so the component name is the clearest indicator of the source of an input component when a plan is being examined.

q q

With those considerations in mind, the following naming rules are suggested: 1. Prefix each output field name with the type of component. Component Address Validator Bigram Character Labeller Context Parser Edit Distance Hamming Distance Jaro Distance Merge Mixed Field Matcher Prefix AV_ BG_ CL_ CP_ ED_ HD_ JD_ MG_ MFM_

INFORMATICA CONFIDENTIAL

BEST PRACTICES

392 of 954

Nysiis Profile Standardizer Rule Based Analyzer Scripting Search Replace Soundex Splitter To Upper Token Labeller Token Parser Weight Based Analyzer Word Manager

NYS_ PS_ RBA_ SC_ SR_ SX_ SPL_ TU_ TL_ TP_ WBA_ WM_

2. Use meaningful field names, with consistent, documented abbreviations. 3. Use consistent casing. 4. While it is possible to rename output fields in sink components, this practice should be avoided when practical, since there is no convenient way to determine which source field provides data to the renamed output field.

Last updated: 04-Jun-08 18:50

INFORMATICA CONFIDENTIAL

BEST PRACTICES

393 of 954

Performing Incremental Loads Challenge


Data warehousing incorporates very large volumes of data. The process of loading the warehouse in a reasonable timescale without compromising its functionality is extremely difficult. The goal is to create a load strategy that can minimize downtime for the warehouse and allow quick and robust data management.

Description
As time windows shrink and data volumes increase, it is important to understand the impact of a suitable incremental load strategy. The design should allow data to be incrementally added to the data warehouse with minimal impact on the overall system. This Best Practice describes several possible load strategies.

Incremental Aggregation
Incremental aggregation is useful for applying incrementally-captured changes in the source to aggregate calculations in a session. If the source changes only incrementally, and you can capture those changes, you can configure the session to process only those changes with each run. This allows the PowerCenter Integration Service to update the target incrementally, rather than forcing it to process the entire source and recalculate the same calculations each time you run the session. If the session performs incremental aggregation, the PowerCenter Integration Service saves index and data cache information to disk when the session finishes. The next time the session runs, the PowerCenter Integration Service uses this historical information to perform the incremental aggregation. To utilize this functionality set the Incremental Aggregation Session attribute. For details see Chapter 24 in the Workflow Administration Guide. Use incremental aggregation under the following conditions:
q q

Your mapping includes an aggregate function. The source changes only incrementally.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

394 of 954

You can capture incremental changes (i.e., by filtering source data by timestamp). You get only delta records (i.e., you may have implemented the CDC (Change Data Capture) feature of PowerExchange).

Do not use incremental aggregation in the following circumstances:


q q

You cannot capture new source data. Processing the incrementally-changed source significantly changes the target. If processing the incrementally-changed source alters more than half the existing target, the session may not benefit from using incremental aggregation. Your mapping contains percentile or median functions.

Some conditions that may help in making a decision on an incremental strategy include:
q

Error handling, loading and unloading strategies for recovering, reloading, and unloading data. History tracking requirements for keeping track of what has been loaded and when Slowly-changing dimensions. Informatica Mapping Wizards are a good start to an incremental load strategy. The Wizards generate generic mappings as a starting point (refer to Chapter 15 in the Designer Guide)

Source Analysis
Data sources typically fall into the following possible scenarios:
q

Delta records. Records supplied by the source system include only new or changed records. In this scenario, all records are generally inserted or updated into the data warehouse. Record indicator or flags. Records that include columns that specify the intention of the record to be populated into the warehouse. Records can be selected based upon this flag for all inserts, updates, and deletes. Date stamped data. Data is organized by timestamps, and loaded into the warehouse based upon the last processing date or the effective date range. Key values are present. When only key values are present, data must be checked against what has already been entered into the warehouse. All values must be checked before entering the warehouse.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

395 of 954

No key values present. When no key values are present, surrogate keys are created and all data is inserted into the warehouse based upon validity of the records.

Identify Records for Comparison


After the sources are identified, you need to determine which records need to be entered into the warehouse and how. Here are some considerations:
q

Compare with the target table. When source delta loads are received, determine if the record exists in the target table. The timestamps and natural keys of the record are the starting point for identifying whether the record is new, modified, or should be archived. If the record does not exist in the target, insert the record as a new row. If it does exist, determine if the record needs to be updated, inserted as a new record, or removed (deleted from target) or filtered out and not added to the target. Record indicators. Record indicators can be beneficial when lookups into the target are not necessary. Take care to ensure that the record exists for update or delete scenarios, or does not exist for successful inserts. Some design effort may be needed to manage errors in these situations.

Determine Method of Comparison


There are four main strategies in mapping design that can be used as a method of comparison:
q

Joins of sources to targets. Records are directly joined to the target using Source Qualifier join conditions or using Joiner transformations after the Source Qualifiers (for heterogeneous sources). When using Joiner transformations, take care to ensure the data volumes are manageable and that the smaller of the two datasets is configured as the Master side of the join. Lookup on target. Using the Lookup transformation, lookup the keys or critical columns in the target relational database. Consider the caches and indexing possibilities. Load table log. Generate a log table of records that have already been inserted into the target system. You can use this table for comparison with lookups or joins, depending on the need and volume. For example, store keys in a separate table and compare source records against this log table to determine load strategy. Another example is to store the dates associated with the data already loaded into a log table. MD5 checksum function. Generate a unique value for each row of data and then compare previous and current unique checksum values to determine

INFORMATICA CONFIDENTIAL

BEST PRACTICES

396 of 954

whether the record has changed.

Source-Based Load Strategies Complete Incremental Loads in a Single File/Table


The simplest method for incremental loads is from flat files or a database in which all records are going to be loaded. This strategy requires bulk loads into the warehouse with no overhead on processing of the sources or sorting the source records. Data can be loaded directly from the source locations into the data warehouse. There is no additional overhead produced in moving these sources into the warehouse.

Date-Stamped Data
This method involves data that has been stamped using effective dates or sequences. The incremental load can be determined by dates greater than the previous load date or data that has an effective key greater than the last key processed. With the use of relational sources, the records can be selected based on this effective date and only those records past a certain date are loaded into the warehouse. Views can also be created to perform the selection criteria. This way, the processing does not have to be incorporated into the mappings but is kept on the source component. Placing the load strategy into the other mapping components is more flexible and controllable by the Data Integration developers and the associated metadata. To compare the effective dates, you can use mapping variables to provide the previous date processed (see the description below). An alternative to Repository-maintained mapping variables is the use of control tables to store the dates and update the control table after each load. Non-relational data can be filtered as records are loaded based upon the effective dates or sequenced keys. A Router transformation or filter can be placed after the Source Qualifier to remove old records.

Changed Data Based on Keys or Record Information


Data that is uniquely identified by keys can be sourced according to selection criteria. For example, records that contain primary keys or alternate keys can be used to determine if they have already been entered into the data warehouse. If they exist, you
INFORMATICA CONFIDENTIAL BEST PRACTICES 397 of 954

can also check to see if you need to update these records or discard the source record. It may be possible to perform a join with the target tables in which new data can be selected and loaded into the target. It may also be feasible to lookup in the target to see if the data exists.

Target-Based Load Strategies


q

Loading directly into the target. Loading directly into the target is possible when the data is going to be bulk loaded. The mapping is then responsible for error control, recovery, and update strategy. Load into flat files and bulk load using an external loader. The mapping loads data directly into flat files. You can then invoke an external loader to bulk load the data into the target. This method reduces the load times (with less downtime for the data warehouse) and provides a means of maintaining a history of data being loaded into the target. Typically, this method is only used for updates into the warehouse. Load into a mirror database. The data is loaded into a mirror database to avoid downtime of the active data warehouse. After data has been loaded, the databases are switched, making the mirror the active database and the active the mirror.

Using Mapping Variables


You can use a mapping variable to perform incremental loading. By referencing a datebased mapping variable in the Source Qualifier or join condition, it is possible to select only those rows with greater than the previously captured date (i.e., the newly inserted source data). However, the source system must have a reliable date to use. The steps involved in this method are:

Step 1: Create mapping variable


In the Mapping Designer, choose Mappings > Parameters > Variables. Or, to create variables for a mapplet, choose Mapplet > Parameters > Variables in the Mapplet Designer. Click Add and enter the name of the variable (i.e., $$INCREMENT DATE). In this case, make your variable a date/time. For the Aggregation option, select MAX. In the same screen, state your initial value. This date is used during the initial run of the
INFORMATICA CONFIDENTIAL BEST PRACTICES 398 of 954

session and as such should represent a date earlier than the earliest desired data. The date can use any one of these formats:
q q q q

MM/DD/RR MM/DD/RR HH24:MI:SS MM/DD/YYYY MM/DD/YYYY HH24:MI:SS

Step 2: Reference the mapping variable in the Source Qualifier


The select statement should look like the following: Select * from table A where CREATE DATE > date($$INCREMENT_DATE. MM-DD-YYYY HH24:MI:SS)

Step 3: Refresh the mapping variable for the next session run using an Expression Transformation
Use an Expression transformation and the pre-defined variable functions to set and use the mapping variable. In the expression transformation, create a variable port and use the SETMAXVARIABLE variable function to capture the maximum source date selected during each run. SETMAXVARIABLE($$INCREMENT_DATE,CREATE_DATE) CREATE_DATE in this example is the date field from the source that should be used to identify incremental rows. You can use the variables in the following transformations:
q q q q

Expression Filter Router Update Strategy

INFORMATICA CONFIDENTIAL

BEST PRACTICES

399 of 954

As the session runs, the variable is refreshed with the max date value encountered between the source and variable. So, if one row comes through with 9/1/2004, then the variable gets that value. If all subsequent rows are LESS than that, then 9/1/2004 is preserved. Note: This behavior has no effect on the date used in the source qualifier. The initial select always contains the maximum data value encountered during the previous, successful session run. When the mapping completes, the PERSISTENT value of the mapping variable is stored in the repository for the next run of your session. You can view the value of the mapping variable in the session log file. The advantage of the mapping variable and incremental loading is that it allows the session to use only the new rows of data. No table is needed to store the max(date) since the variable takes care of it. After a successful session run, the PowerCenter Integration Service saves the final value of each variable in the repository. So when you run your session the next time, only new data from the source system is captured. If necessary, you can override the value saved in the repository with a value saved in a parameter file.

Using PowerExchange Change Data Capture


PowerExchange (PWX) Change Data Capture (CDC) greatly simplifies the identification, extraction, and loading of change records. It supports all key mainframe and midrange database systems, requires no changes to the user application, uses vendor-supplied technology where possible to capture changes, and eliminates the need for programming or the use of triggers. Once PWX CDC collects changes, it places them in a change stream for delivery to PowerCenter. Included in the change data is useful control information, such as the transaction type (insert/update/delete) and the transaction timestamp. In addition, the change data can be made available immediately (i.e., in real time) or periodically (i.e., where changes are condensed). The native interface between PowerCenter and PowerExchange is PowerExchange Client for PowerCenter (PWXPC). PWXPC enables PowerCenter to pull the change data from the PWX change stream if real-time consumption is needed or from PWX condense files if periodic consumption is required. The changes are applied directly. So if the action flag is I, the record is inserted. If the action flag is U, the record is updated. If the action flag is D, the record is deleted. There is no need for change detection logic in the PowerCenter mapping.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

400 of 954

In addition, by leveraging group source processing, where multiple sources are placed in a single mapping, the PowerCenter session reads the committed changes for multiple sources in a single efficient pass, and in the order they occurred. The changes are then propagated to the targets, and upon session completion, restart tokens (markers) are written out to a PowerCenter file so that the next session run knows the point to extract from.

Tips for Using PWX CDC

After installing PWX, ensure the PWX Listener is up and running and that connectivity is established to the Listener. For best performance, the Listener should be co-located with the source system. In the PWX Navigator client tool, use metadata to configure data access. This means creating data maps for the non-relational to relational view of mainframe sources (such as IMS and VSAM) and capture registrations for all sources (mainframe, Oracle, DB2, etc). Registrations define the specific tables and columns desired for change capture. There should be one registration per source. Group the registrations logically, for example, by source database. For an initial test, make changes in the source system to the registered sources. Ensure that the changes are committed. Still working in PWX Navigator (and before using PowerCenter), perform Row Tests to verify the returned change records, including the transaction action flag (the DTL__CAPXACTION column) and the timestamp. Set the required access mode: CAPX for change and CAPXRT for real time. Also, if desired, edit the PWX extraction maps to add the Change Indicator (CI) column. This CI flag (Y or N) allows for field level capture and can be filtered in the PowerCenter mapping. Use PowerCenter to materialize the targets (i.e., to ensure that sources and targets are in sync prior to starting the change capture process). This can be accomplished with a simple pass-through batch mapping. This same bulk mapping can be reused for CDC purposes, but only if specific CDC columns are not included, and by changing the session connection/mode. Import the PWX extraction maps into Designer. This requires the PWXPC

INFORMATICA CONFIDENTIAL

BEST PRACTICES

401 of 954

component. Specify the CDC Datamaps option during the import.


q

Use group sourcing to create the CDC mapping by including multiple sources in the mapping. This enhances performance because only one read/ connection is made to the PWX Listener and all changes (for the sources in the mapping) are pulled at one time. Keep the CDC mappings simple. There are some limitations; for instance, you cannot use active transformations. In addition, if loading to a staging area, store the transaction types (i.e., insert/update/delete) and the timestamp for subsequent processing downstream. Also, if loading to a staging area, include an Update Strategy transformation in the mapping with DD_INSERT or DD_UPDATE in order to override the default behavior and store the action flags. Set up the Application Connection in Workflow Manager to be used by the CDC session. This requires the PWXPC component. There should be one connection and token file per CDC mapping/session. Set the UOW (unit of work) to a low value for faster commits to the target for real-time sessions. Specify the restart token location and file on the PowerCenter Integration Service (within the infa_shared directory) and specify the location of the PWX Listener. In the CDC session properties, enable session recovery (i.e., set the Recovery Strategy to Resume from last checkpoint). Use post-session commands to archive the restart token files for restart/ recovery purposes. Also, archive the session logs.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

402 of 954

Real-Time Integration with PowerCenter Challenge


Configure PowerCenter to work with various PowerExchange data access products to process real-time data. This Best Practice discusses guidelines for establishing a connection with PowerCenter and setting up a realtime session to work with PowerCenter.

Description
PowerCenter with real-time option can be used to process data from real-time data sources. PowerCenter supports the following types of real-time data:
q

Messages and message queues. PowerCenter with the real-time option can be used to integrate third-party messaging applications using a specific PowerExchange data access product. Each PowerExchange product supports a specific industry-standard messaging application, such as WebSphere MQ, JMS, MSMQ, SAP NetWeaver, TIBCO, and webMethods. You can read from messages and message queues and write to messages, messaging applications, and message queues. WebSphere MQ uses a queue to store and exchange data. Other applications, such as TIBCO and JMS, use a publish/subscribe model. In this case, the message exchange is identified using a topic. Web service messages. PowerCenter can receive a web service message from a web service client through the Web Services Hub, transform the data, and load the data to a target or send a message back to a web service client. A web service message is a SOAP request from a web service client or a SOAP response from the Web Services Hub. The Integration Service processes real-time data from a web service client by receiving a message request through the Web Services Hub and processing the request. The Integration Service can send a reply back to the web service client through the Web Services Hub or write the data to a target. Changed source data. PowerCenter can extract changed data in real time from a source table using the PowerExchange Listener and write data to a target. Real-time sources supported by PowerExchange are ADABAS, DATACOM, DB2/390, DB2/400, DB2/UDB, IDMS, IMS, MS SQL Server, Oracle and VSAM.

Connection Setup
PowerCenter uses some attribute values in order to correctly connect and identify the third-party messaging application and message itself. Each PowerExchange product supplies its own connection attributes that need to be configured properly before running a real-time session.

Setting Up Real-Time Session in PowerCenter


The PowerCenter real-time option uses a zero latency engine to process data from the messaging system. Depending on the messaging systems and the application that sends and receives messages, there may be a period when there are many messages and, conversely, there may be a period when there are no messages. PowerCenter uses the attribute Flush Latency to determine how often the messages are being flushed to the target. PowerCenter also provides various attributes to control when the session ends. The following reader attributes determine when a PowerCenter session should end:
q

Message Count - Controls the number of messages the PowerCenter Server reads from the source
BEST PRACTICES 403 of 954

INFORMATICA CONFIDENTIAL

before the session stops reading from the source.


q

Idle Time - Indicates how long the PowerCenter Server waits when no messages arrive before it stops reading from the source. Time Slice Mode - Indicates a specific range of time that the server read messages from the source. Only PowerExchange for WebSphere MQ uses this option. Reader Time Limit - Indicates the number of seconds the PowerCenter Server spends reading messages from the source.

The specific filter conditions and options available to you depend on which Real-Time source is being used. For example -Attributes for PowerExchange for DB2 for i5/OS:

Set the attributes that control how the reader ends. One or more attributes can be used to control the end of session. For example, set the Reader Time Limit attribute to 3600. The reader will end after 3600 seconds. The idle time limit is set to 500 seconds. The reader will end if it doesnt process any changes for 500 seconds (i.e., it remains idle for 500 seconds). If more than one attribute is selected, the first attribute that satisfies the condition is used to control the end of session.
INFORMATICA CONFIDENTIAL BEST PRACTICES 404 of 954

Note:: The real-time attributes can be found in the Reader Properties for PowerExchange for JMS, TIBCO, webMethods, and SAP iDoc. For PowerExchange for WebSphere MQ , the real-time attributes must be specified as a filter condition. The next step is to set the Real-time Flush Latency attribute. The Flush Latency defines how often PowerCenter should flush messages, expressed in milli-seconds. For example, if the Real-time Flush Latency is set to 2000, PowerCenter flushes messages every two seconds. The messages will also be flushed from the reader buffer if the Source Based Commit condition is reached. The Source Based Commit condition is defined in the Properties tab of the session. The message recovery option can be enabled to ensure that no messages are lost if a session fails as a result of unpredictable error, such as power loss. This is especially important for real-time sessions because some messaging applications do not store the messages after the messages are consumed by another application. A unit of work (UOW) is a collection of changes within a single commit scope made by a transaction on the source system from an external application. Each UOW may consist of a different number of rows depending on the transaction to the source system. When you use the UOW Count Session condition, the Integration Service commits source data to the target when it reaches the number of UOWs specified in the session condition. For example, if the value for UOW Count is 10, the Integration Service commits all data read from the source after the 10th UOW enters the source. The lower you set the value, the faster the Integration Service commits data to the target. The lower value also causes the system to consume more resources.

Executing a Real-Time Session


A real-time session often has to be up and running continuously to listen to the messaging application and to process messages immediately after the messages arrive. Set the reader attribute Idle Time to -1 and Flush Latency to a specific time interval. This is applicable for all PowerExchange products except for PowerExchange for WebSphere MQ where the session continues to run and flush the messages to the target using the specific flush latency interval. Another scenario is the ability to read data from another source system and immediately send it to a real-time target. For example, reading data from a relational source and writing it to WebSphere MQ. In this case, set the session to run continuously so that every change in the source system can be immediately reflected in the target. A real-time session may run continuously until a condition is met to end the session. In some situations it may be required to periodically stop the session and restart it. This is sometimes necessary to execute a postsession command or run some other process that is not part of the session. To stop the session and restart it, it is useful to deploy continuously running workflows. The Integration Service starts the next run of a continuous workflow as soon as it completes the first. To set a workflow to run continuously, edit the workflow and select the Scheduler tab. Edit the Scheduler and select Run Continuously from Run Options. A continuous workflow starts automatically when the Integration Service initializes. When the workflow stops, it restarts immediately.

Real-Time Sessions and Active Transformations


Some of the transformations in PowerCenter are active transformations, which means that the number of input rows and output rows of the transformations are not the same. For most cases, active transformation requires
INFORMATICA CONFIDENTIAL BEST PRACTICES 405 of 954

all of the input rows to be processed before processing the output row to the next transformation or target. For a real-time session, the flush latency will be ignored if DTM needs to wait for all the rows to be processed. Depending on user needs, active transformations, such as aggregator, rank, sorter can be used in a real-time session by setting the transaction scope property in the active transformation to Transaction. This signals the session to process the data in the transformation every transaction. For example, if a real-time session is using an aggregator that sums a field of an input, the summation will be done per transaction, as opposed to all rows. The result may or may not be correct depending on the requirement. Use the active transformation with realtime session if you want to process the data per transaction. Custom transformations can also be defined to handle data per transaction so that they can be used in a realtime session.

PowerExchange Real Time Connections


PowerExchange NRDB CDC Real Time connections can be used to extract changes from ADABAS, DATACOM, IDMS, IMS and VSAM sources in real time. The DB2/390 connection can be used to extract changes for DB2 on OS/390 and the DB2/400 connection to extract from AS/400. There is a separate connection to read from DB2 UDB in real time. The NRDB CDC connection requires the application name and the restart token file name to be overridden for every session. When the PowerCenter session completes, the PowerCenter Server writes the last restart token to a physical file called the RestartToken File. The next time the session starts, the PowerCenter Server reads the restart token from the file and the starts reading changes from the point where it last left off. Every PowerCenter session needs to have a unique restart token filename. Informatica recommends archiving the file periodically. The reader timeout or the idle timeout can be used to stop a real-time session. A post-session command can be used to archive the RestartToken file. The encryption mode for this connection can slow down the read performance and increase resource consumption. Compression mode can help in situations where the network is a bottleneck; using compression also increases the CPU and memory usage on the source system.

Archiving PowerExchange Tokens


When the PowerCenter session completes, the Integration Service writes the last restart token to a physical file called the RestartToken File. The token in the file indicates the end point where the read job ended. The next time the session starts, the PowerCenter Server reads the restart token from the file and the starts reading changes from the point where it left off. The token file is overwritten each time the session has to write a token out. PowerCenter does not implicitly maintain an archive of these tokens. If, for some reason, the changes from a particular point in time have to replayed, we need the PowerExchange token from that point in time. To enable such a process, it is a good practice to periodically copy the token file to a backup folder. This procedure is necessary to maintain an archive of the PowerExchange tokens. A real-time PowerExchange session may be stopped periodically, using either the reader time limit or the idle time limit. A post-session command is used to copy the restart token file to an archive folder. The session will be part of a continuous running workflow, so when the session completes after the post session command, it automatically restarts again. From a data processing standpoint very little changes; the process pauses for a moment, archives the token, and starts again.
INFORMATICA CONFIDENTIAL BEST PRACTICES 406 of 954

The following are examples of post-session commands that can be used to copy a restart token file (session. token) and append the current system date/time to the file name for archive purposes: cp session.token session`date '+%m%d%H%M'`.token Windows: copy session.token session-%date:~4,2%-%date:~7,2%-%date:~10,4%-%time:~0,2%-%time:~3,2%.token

PowerExchange for WebSphere MQ


1. In the Workflow Manager, connect to a repository and choose Connection > Queue 2. The Queue Connection Browser appears. Select New > Message Queue 3. The Connection Object Definition dialog box appears You need to specify three attributes in the Connection Object Definition dialog box:
q

Name - the name for the connection. (Use <queue_name>_<QM_name> to uniquely identify the connection.) Queue Manager - the Queue Manager name for the message queue. (in Windows, the default Queue Manager name is QM_<machine name>) Queue Name - the Message Queue name

To obtain the Queue Manager and Message Queue names:


q q

Open the MQ Series Administration Console. The Queue Manager should appear on the left panel Expand the Queue Manager icon. A list of the queues for the queue manager appears on the left panel

Note that the Queue Managers name and Queue Name are case-sensitive.

PowerExchange for JMS


PowerExchange for JMS can be used to read or write messages from various JMS providers, such as WebSphere MQ, JMS, BEA WebLogic Server. There are two types of JMS application connections:
q q

JNDI Application Connection, which is used to connect to a JNDI server during a session run. JMS Application Connection, which is used to connect to a JMS provider during a session run.

JNDI Application Connection Attributes are:


q q q q

Name JNDI Context Factory JNDI Provider URL JNDI UserName


BEST PRACTICES 407 of 954

INFORMATICA CONFIDENTIAL

q q

JNDI Password JMS Application Connection

JMS Application Connection Attributes are:


q q q q q q

Name JMS Destination Type JMS Connection Factory Name JMS Destination JMS UserName JMS Password

Configuring the JNDI Connection for WebSphere MQ


The JNDI settings for WebSphere MQ JMS can be configured using a file system service or LDAP (Lightweight Directory Access Protocol). The JNDI setting is stored in a file named JMSAdmin.config. The file should be installed in the WebSphere MQ Java installation/bin directory. If you are using a file system service provider to store your JNDI settings, remove the number sign (#) before the following context factory setting: INITIAL_CONTEXT_FACTORY=com.sun.jndi.fscontext.RefFSContextFactory Or, if you are using the LDAP service provider to store your JNDI settings, remove the number sign (#) before the following context factory setting: INITIAL_CONTEXT_FACTORY=com.sun.jndi.ldap.LdapCtxFactory Find the PROVIDER_URL settings. If you are using a file system service provider to store your JNDI settings, remove the number sign (#) before the following provider URL setting and provide a value for the JNDI directory. PROVIDER_URL=file: /<JNDI directory> <JNDI directory> is the directory where you want JNDI to store the .binding file. Or, if you are using the LDAP service provider to store your JNDI settings, remove the number sign (#) before the provider URL setting and specify a hostname. #PROVIDER_URL=ldap://<hostname>/context_name For example, you can specify: PROVIDER_URL=ldap://<localhost>/o=infa,c=rc

INFORMATICA CONFIDENTIAL

BEST PRACTICES

408 of 954

If you want to provide a user DN and password for connecting to JNDI, you can remove the # from the following settings and enter a user DN and password: PROVIDER_USERDN=cn=myname,o=infa,c=rc PROVIDER_PASSWORD=test The following table shows the JMSAdmin.config settings and the corresponding attributes in the JNDI application connection in the Workflow Manager:

JMSAdmin.config Settings:

JNDI Application Connection Attribute

INITIAL_CONTEXT_FACTORY

JNDI Context Factory

PROVIDER_URL

JNDI Provider URL

PROVIDER_USERDN

JNDI UserName

PROVIDER_PASSWORD

JNDI Password

Configuring the JMS Connection for WebSphere MQ


The JMS connection is defined using a tool in JMS called jmsadmin, which is available in the WebSphere MQ Java installation/bin directory. Use this tool to configure the JMS Connection Factory. The JMS Connection Factory can be a Queue Connection Factory or Topic Connection Factory.
q q

When Queue Connection Factory is used, define a JMS queue as the destination. When Connection Factory is used, define a JMS topic as the destination.

The command to define a queue connection factory (qcf) is: def qcf(<qcf_name>) qmgr(queue_manager_name) hostname (QM_machine_hostname) port (QM_machine_port) The command to define JMS queue is: def q(<JMS_queue_name>) qmgr(queue_manager_name) qu(queue_manager_queue_name) The command to define JMS topic connection factory (tcf) is: def tcf(<tcf_name>) qmgr(queue_manager_name) hostname (QM_machine_hostname) port (QM_machine_port)
INFORMATICA CONFIDENTIAL BEST PRACTICES 409 of 954

The command to define the JMS topic is: def t(<JMS_topic_name>) topic(pub/sub_topic_name) The topic name must be unique. For example: topic (application/infa) The following table shows the JMS object types and the corresponding attributes in the JMS application connection in the Workflow Manager:

JMS Object Types

JMS Application Connection Attribute

QueueConnectionFactory or TopicConnectionFactory

JMS Connection Name

JMS Queue Name or JMS Topic Name

JMS Destination

Configure the JNDI and JMS Connection for WebSphere


Configure the JNDI settings for WebSphere to use WebSphere as a provider for JMS sources or targets in a PowerCenterRT session. JNDI Connection Add the following option to the file JMSAdmin.bat to configure JMS properly: -Djava.ext.dirs=<WebSphere Application Server>bin For example: -Djava.ext.dirs=WebSphere\AppServer\bin The JNDI connection resides in the JMSAdmin.config file, which is located in the MQ Series Java/bin directory. INITIAL_CONTEXT_FACTORY=com.ibm.websphere.naming.wsInitialContextFactory PROVIDER_URL=iiop://<hostname>/ For example: PROVIDER_URL=iiop://localhost/ PROVIDER_USERDN=cn=informatica,o=infa,c=rc PROVIDER_PASSWORD=test JMS Connection

INFORMATICA CONFIDENTIAL

BEST PRACTICES

410 of 954

The JMS configuration is similar to the JMS Connection for WebSphere MQ.

Configure the JNDI and JMS Connection for BEA WebLogic


Configure the JNDI settings for BEA WebLogic to use BEA WebLogic as a provider for JMS sources or targets in a PowerCenterRT session. PowerCenter Connect for JMS and the JMS hosting Weblogic server do not need to be on the same server. PowerCenter Connect for JMS just needs a URL, as long as the URL points to the right place. JNDI Connection The WebLogic Server automatically provides a context factory and URL during the JNDI set-up configuration for WebLogic Server. Enter these values to configure the JNDI connection for JMS sources and targets in the Workflow Manager. Enter the following value for JNDI Context Factory in the JNDI Application Connection in the Workflow Manager: weblogic.jndi.WLInitialContextFactory Enter the following value for JNDI Provider URL in the JNDI Application Connection in the Workflow Manager: t3://<WebLogic_Server_hostname>:<port> where WebLogic Server hostname is the hostname or IP address of the WebLogic Server and port is the port number for the WebLogic Server. JMS Connection The JMS connection is configured from the BEA WebLogic Server console. Select JMS -> Connection Factory. The JMS Destination is also configured from the BEA WebLogic Server console. From the Console pane, select Services > JMS > Servers > <JMS Server name> > Destinations under your domain. Click Configure a New JMSQueue or Configure a New JMSTopic. The following table shows the JMS object types and the corresponding attributes in the JMS application connection in the Workflow Manager:

WebLogic Server JMS Object

JMS Application Connection Attribute

Connection Factory Settings: JNDIName

JMS Application Connection Attribute

INFORMATICA CONFIDENTIAL

BEST PRACTICES

411 of 954

Connection Factory Settings: JNDIName

JMS Connection Factory Name

Destination Settings: JNDIName

JMS Destination

In addition to JNDI and JMS setting, BEA WebLogic also offers a function called JMS Store, which can be used for persistent messaging when reading and writing JMS messages. The JMS Stores configuration is available from the Console pane: select Services > JMS > Stores under your domain.

Configuring the JNDI and JMS Connection for TIBCO


TIBCO Rendezvous Server does not adhere to JMS specifications. As a result, PowerCenter Connect for JMS cant connect directly with the Rendezvous Server. TIBCO Enterprise Server, which is JMS-compliant, acts as a bridge between the PowerCenter Connect for JMS and TIBCO Rendezvous Server. Configure a connectionbridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server for PowerCenter Connect for JMS to be able to read messages from and write messages to TIBCO Rendezvous Server. To create a connection-bridge between PowerCenter Connect for JMS and TIBCO Rendezvous Server, follow these steps: 1. Configure PowerCenter Connect for JMS to communicate with TIBCO Enterprise Server. 2. Configure TIBCO Enterprise Server to communicate with TIBCO Rendezvous Server. Configure the following information in your JNDI application connection:
q q

JNDI Context Factory.com.tibco.tibjms.naming.TibjmsInitialContextFactory Provider URL.tibjmsnaming://<host>:<port> where host and port are the host name and port number of the Enterprise Server.

To make a connection-bridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server: 1. In the file tibjmsd.conf, enable the tibrv transport configuration parameter as in the example below, so that TIBCO Enterprise Server can communicate with TIBCO Rendezvous messaging systems: tibrv_transports = enabled 2.

Enter the following transports in the transports.conf file: [RV] type = tibrv // type of external messaging system topic_import_dm = TIBJMS_RELIABLE // only reliable/certified messages can transfer daemon = tcp:localhost:7500 // default daemon for the Rendezvous server The transports in the transports.conf configuration file specify the communication protocol between TIBCO Enterprise for JMS and the TIBCO Rendezvous system. The import and export properties on a destination can list one or more transports to use to communicate with the TIBCO Rendezvous system.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

412 of 954

3. Optionally, specify the name of one or more transports for reliable and certified message delivery in the export property in the file topics.conf. as in the following example: topicname export="RV" The export property allows messages published to a topic by a JMS client to be exported to the external systems with configured transports. Currently, you can configure transports for TIBCO Rendezvous reliable and certified messaging protocols.

PowerExchange for webMethods


When importing webMethods sources into the Designer, be sure the webMethods host file doesnt contain . character. You cant use fully-qualified names for the connection when importing webMethods sources. You can use fully-qualified names for the connection when importing webMethods targets because PowerCenter doesnt use the same grouping method for importing sources and targets. To get around this, modify the host file to resolve the name to the IP address. For example: Host File: crpc23232.crp.informatica.com crpc23232

Use crpc23232 instead of crpc23232.crp.informatica.com as the host name when importing webMethods source definition. This step is only required for importing PowerExchange for webMethods sources into the Designer. If you are using the request/reply model in webMethods, PowerCenter needs to send an appropriate document back to the broker for every document it receives. PowerCenter populates some of the envelope fields of the webMethods target to enable webMethods broker to recognize that the published document is a reply from PowerCenter. The envelope fields destid and tag are populated for the request/reply model. Destid should be populated from the pubid of the source document and tag should be populated from tag of the source document. Use the option Create Default Envelope Fields when importing webMethods sources and targets into the Designer in order to make the envelope fields available in PowerCenter.

Configuring the PowerExchange for webMethods Connection


To create or edit the PowerExchange for webMethods connection select Connections > Application > webMethods Broker from the Workflow Manager. PowerExchange for webMethods connection attributes are:
q q q q q q q q

Name Broker Host Broker Name Client ID Client Group Application Name Automatic Reconnect Preserve Client State
BEST PRACTICES 413 of 954

INFORMATICA CONFIDENTIAL

Enter the connection to the Broker Host in the following format <hostname: port>. If you are using the request/reply method in webMethods, you have to specify a client ID in the connection. Be sure that the client ID used in the request connection is the same as the client ID used in the reply connection. Note that if you are using multiple request/reply document pairs, you need to setup different webMethods connections for each pair because they cannot share a client ID.

Last updated: 27-May-08 16:27

INFORMATICA CONFIDENTIAL

BEST PRACTICES

414 of 954

Session and Data Partitioning Challenge


Improving performance by identifying strategies for partitioning relational tables, XML, COBOL and standard flat files, and by coordinating the interaction between sessions, partitions, and CPUs. These strategies take advantage of the enhanced partitioning capabilities in PowerCenter.

Description
On hardware systems that are under-utilized, you may be able to improve performance by processing partitioned data sets in parallel in multiple threads of the same session instance running onthe PowerCenter Server engine. However, parallel execution may impair performance on over-utilized systems or systems with smaller I/O capacity. In addition to hardware, consider these other factors when determining if a session is an ideal candidate for partitioning: source and target database setup, target type, mapping design, and certain assumptions that are explained in the following paragraphs. Use the Workflow Manager client tool to implement session partitioning.

Assumptions
The following assumptions pertain to the source and target systems of a session that is a candidate for partitioning. These factors can help to maximize the benefits that can be achieved through partitioning.
q

Indexing has been implemented on the partition key when using a relational source. Source files are located on the same physical machine as the PowerCenter Server process when partitioning flat files, COBOL, and XML, to reduce network overhead and delay. All possible constraints are dropped or disabled on relational targets. All possible indexes are dropped or disabled on relational targets. Table spaces and database partitions are properly managed on the target system. Target files are written to same physical machine that hosts the PowerCenter

q q q

INFORMATICA CONFIDENTIAL

BEST PRACTICES

415 of 954

process in order to reduce network overhead and delay.


q

Oracle External Loaders are utilized whenever possible

First, determine if you should partition your session. Parallel execution benefits systems that have the following characteristics: Check idle time and busy percentage for each thread. This gives the high-level information of the bottleneck point/points. In order to do this, open the session log and look for messages starting with PETL_ under the RUN INFO FOR TGT LOAD ORDER GROUP section. These PETL messages give the following details against the reader, transformation, and writer threads:
q q q

Total Run Time Total Idle Time Busy Percentage

Under-utilized or intermittently-used CPUs. To determine if this is the case, check the CPU usage of your machine. The column ID displays the percentage utilization of CPU idling during the specified interval without any I/O wait. If there are CPU cycles available (i.e., twenty percent or more idle time), then this session's performance may be improved by adding a partition.
q q

Windows 2000/2003 - check the task manager performance tab. UNIX - type VMSTAT 1 10 on the command line.

Sufficient I/O. To determine the I/O statistics:


q q

Windows 2000/2003 - check the task manager performance tab. UNIX - type IOSTAT on the command line. The column %IOWAIT displays the percentage of CPU time spent idling while waiting for I/O requests. The column %idle displays the total percentage of the time that the CPU spends idling (i.e., the unused capacity of the CPU.)

Sufficient memory. If too much memory is allocated to your session, you will receive a memory allocation error. Check to see that you're using as much memory as you can. If the session is paging, increase the memory. To determine if the session is paging:
q

Windows 2000/2003 - check the task manager performance tab.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

416 of 954

UNIX - type VMSTAT 1 10 on the command line. PI displays number of pages swapped in from the page space during the specified interval. PO displays the number of pages swapped out to the page space during the specified interval. If these values indicate that paging is occurring, it may be necessary to allocate more memory, if possible.

If you determine that partitioning is practical, you can begin setting up the partition.

Partition Types
PowerCenter provides increased control of the pipeline threads. Session performance can be improved by adding partitions at various pipeline partition points. When you configure the partitioning information for a pipeline, you must specify a partition type. The partition type determines how the PowerCenter Server redistributes data across partition points. The Workflow Manager allows you to specify the following partition types:

Round-robin Partitioning
The PowerCenter Server distributes data evenly among all partitions. Use round-robin partitioning when you need to distribute rows evenly and do not need to group data among partitions. In a pipeline that reads data from file sources of different sizes, use round-robin partitioning. For example, consider a session based on a mapping that reads data from three flat files of different sizes.
q q q

Source file 1: 100,000 rows Source file 2: 5,000 rows Source file 3: 20,000 rows

In this scenario, the recommended best practice is to set a partition point after the Source Qualifier and set the partition type to round-robin. The PowerCenter Server distributes the data so that each partition processes approximately one third of the data.

Hash Partitioning
The PowerCenter Server applies a hash function to a partition key to group data among

INFORMATICA CONFIDENTIAL

BEST PRACTICES

417 of 954

partitions. Use hash partitioning where you want to ensure that the PowerCenter Server processes groups of rows with the same partition key in the same partition. For example, in a scenario where you need to sort items by item ID, but do not know the number of items that have a particular ID number. If you select hash auto-keys, the PowerCenter Server uses all grouped or sorted ports as the partition key. If you select hash user keys, you specify a number of ports to form the partition key. An example of this type of partitioning is when you are using Aggregators and need to ensure that groups of data based on a primary key are processed in the same partition.

Key Range Partitioning


With this type of partitioning, you specify one or more ports to form a compound partition key for a source or target. The PowerCenter Server then passes data to each partition depending on the ranges you specify for each port. Use key range partitioning where the sources or targets in the pipeline are partitioned by key range. Refer to Workflow Administration Guide for further directions on setting up Key range partitions. For example, with key range partitioning set at End range = 2020, the PowerCenter Server passes in data where values are less than 2020. Similarly, for Start range = 2020, the PowerCenter Server passes in data where values are equal to greater than 2020. Null values or values that may not fall in either partition are passed through the first partition.

Pass-through Partitioning
In this type of partitioning, the PowerCenter Server passes all rows at one partition point to the next partition point without redistributing them. Use pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to (or cannot) change the distribution of data across partitions. The Data Transformation Manager spawns a master thread on each session run, which in turn creates three threads (reader, transformation, and writer threads) by default. Each of these threads can, at the most, process one data set at a time and hence, three data sets simultaneously. If there are complex transformations in the mapping, the transformation thread may take a longer time than the other threads, which can slow data throughput.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

418 of 954

It is advisable to define partition points at these transformations. This creates another pipeline stage and reduces the overhead of a single transformation thread. When you have considered all of these factors and selected a partitioning strategy, you can begin the iterative process of adding partitions. Continue adding partitions to the session until you meet the desired performance threshold or observe degradation in performance.

Tips for Efficient Session and Data Partitioning


q

Add one partition at a time. To best monitor performance, add one partition at a time, and note your session settings before adding additional partitions. Refer to Workflow Administrator Guide, for more information on Restrictions on the Number of Partitions. Set DTM buffer memory. For a session with n partitions, set this value to at least n times the original value for the non-partitioned session. Set cached values for sequence generator. For a session with n partitions, there is generally no need to use the Number of Cached Values property of the sequence generator. If you must set this value to a value greater than zero, make sure it is at least n times the original value for the non-partitioned session. Partition the source data evenly. The source data should be partitioned into equal sized chunks for each partition. Partition tables. A notable increase in performance can also be realized when the actual source and target tables are partitioned. Work with the DBA to discuss the partitioning of source and target tables, and the setup of tablespaces. Consider using external loader. As with any session, using an external loader may increase session performance. You can only use Oracle external loaders for partitioning. Refer to the Session and Server Guide for more information on using and setting up the Oracle external loader for partitioning. Write throughput. Check the session statistics to see if you have increased the write throughput. Paging. Check to see if the session is now causing the system to page. When you partition a session and there are cached lookups, you must make sure that DTM memory is increased to handle the lookup caches. When you partition a source that uses a static lookup cache, the PowerCenter Server creates one memory cache for each partition and one disk cache for each transformation. Thus, memory requirements grow for each partition. If the memory is not bumped up, the system may start paging to disk, causing

INFORMATICA CONFIDENTIAL

BEST PRACTICES

419 of 954

degradation in performance. When you finish partitioning, monitor the session to see if the partition is degrading or improving session performance. If the session performance is improved and the session meets your requirements, add another partition

Session on Grid and Partitioning Across Nodes


Session on Grid (provides the ability to run a session on multi-node integration services. This is most suitable for large-size sessions. For small and medium size sessions, it is more practical to distribute whole sessions to different nodes using Workflow on Grid. Session on Grid leverages existing partitions of a session b executing threads in multiple DTMs. Log service can be used to get the cumulative log. See PowerCenter Enterprise Grid Option for detailed configuration information.

Dynamic Partitioning
Dynamic partitioning is also called parameterized partitioning because a single parameter can determine the number of partitions. With the Session on Grid option, more partitions can be added when more resources are available. Also the number of partitions in a session can be tied to partitions in the database to facilitate maintenance of PowerCenter partitioning to leverage database partitioning.

Last updated: 06-Dec-07 15:04

INFORMATICA CONFIDENTIAL

BEST PRACTICES

420 of 954

Using Parameters, Variables and Parameter Files Challenge


Understanding how parameters, variables, and parameter files work and using them for maximum efficiency.

Description
Prior to the release of PowerCenter 5, the only variables inherent to the product were defined to specific transformations and to those server variables that were global in nature. Transformation variables were defined as variable ports in a transformation and could only be used in that specific transformation object (e.g., Expression, Aggregator, and Rank transformations). Similarly, global parameters defined within Server Manager would affect the subdirectories for source files, target files, log files, and so forth. More current versions of PowerCenter made variables and parameters available across the entire mapping rather than for a specific transformation object. In addition, they provide built-in parameters for use within Workflow Manager. Using parameter files, these values can change from session-run to session-run. With the addition of workflows, parameters can now be passed to every session contained in the workflow, providing more flexibility and reducing parameter file maintenance. Other important functionality that has been added in recent releases is the ability to dynamically create parameter files that can be used in the next session in a workflow or in other workflows.

Parameters and Variables


Use a parameter file to define the values for parameters and variables used in a workflow, worklet, mapping, or session. A parameter file can be created using a text editor such as WordPad or Notepad. List the parameters or variables and their values in the parameter file. Parameter files can contain the following types of parameters and variables:
q q q q

Workflow variables Worklet variables Session parameters Mapping parameters and variables

When using parameters or variables in a workflow, worklet, mapping, or session, the Integration Service checks the parameter file to determine the start value of the parameter or variable. Use a parameter file to initialize workflow variables, worklet variables, mapping parameters, and mapping variables. If not defining start values for these parameters and variables, the Integration Service checks for the start value of the parameter or variable in other places. Session parameters must be defined in a parameter file. Because session parameters do not have default values, if the Integration Service cannot locate the value of a session parameter in the parameter file, it fails to initialize the session. To include parameter or variable information for more than one workflow, worklet, or session in a single parameter file, create separate sections for each object within the parameter file. Also, create multiple parameter files for a single workflow, worklet, or session and change the file that these tasks use, as necessary. To specify the parameter file that the Integration Service uses with a workflow, worklet, or session, do either of the following:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

421 of 954

q q

Enter the parameter file name and directory in the workflow, worklet, or session properties. Start the workflow, worklet, or session using pmcmd and enter the parameter filename and directory in the command line.

If entering a parameter file name and directory in the workflow, worklet, or session properties and in the pmcmd command line, the Integration Service uses the information entered in the pmcmd command line.

Parameter File Format


When entering values in a parameter file, precede the entries with a heading that identifies the workflow, worklet or session whose parameters and variables are to be assigned. Assign individual parameters and variables directly below this heading, entering each parameter or variable on a new line. List parameters and variables in any order for each task. The following heading formats can be defined:
q q q

Workflow variables - [folder name.WF:workflow name] Worklet variables -[folder name.WF:workflow name.WT:worklet name] Worklet variables in nested worklets - [folder name.WF:workflow name.WT:worklet name.WT:worklet name...] Session parameters, plus mapping parameters and variables - [folder name.WF:workflow name.ST: session name] or [folder name.session name] or [session name]

Below each heading, define parameter and variable values as follows:


q q q q

parameter name=value parameter2 name=value variable name=value variable2 name=value

For example, a session in the production folder, s_MonthlyCalculations, uses a string mapping parameter, $ $State, that needs to be set to MA, and a datetime mapping variable, $$Time. $$Time already has an initial value of 9/30/2000 00:00:00 saved in the repository, but this value needs to be overridden to 10/1/2000 00:00:00. The session also uses session parameters to connect to source files and target databases, as well as to write session log to the appropriate session log file. The following table shows the parameters and variables that can be defined in the parameter file:

Parameter and Variable Type String Mapping Parameter Datetime Mapping Variable Source File (Session Parameter)

Parameter and Variable Name $$State $$Time $InputFile1

Desired Definition MA 10/1/2000 00:00:00 Sales.txt

INFORMATICA CONFIDENTIAL

BEST PRACTICES

422 of 954

Database Connection (Session Parameter) Session Log File (Session Parameter)

$DBConnection_Target

Sales (database connection) d:/session logs/firstrun. txt

$PMSessionLogFile

The parameter file for the session includes the folder and session name, as well as each parameter and variable:
q q q q q q

[Production.s_MonthlyCalculations] $$State=MA $$Time=10/1/2000 00:00:00 $InputFile1=sales.txt $DBConnection_target=sales $PMSessionLogFile=D:/session logs/firstrun.txt

The next time the session runs, edit the parameter file to change the state to MD and delete the $$Time variable. This allows the Integration Service to use the value for the variable that was set in the previous session run

Mapping Variables
Declare mapping variables in PowerCenter Designer using the menu option Mappings -> Parameters and Variables (See the first figure, below). After selecting mapping variables, use the pop-up window to create a variable by specifying its name, data type, initial value, aggregation type, precision, and scale. This is similar to creating a port in most transformations (See the second figure, below).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

423 of 954

Variables, by definition, are objects that can change value dynamically. PowerCenter has four functions to affect change to mapping variables:
q q q q

SetVariable SetMaxVariable SetMinVariable SetCountVariable

A mapping variable can store the last value from a session run in the repository to be used as the starting value for the next session run.
q

Name. The name of the variable should be descriptive and be preceded by $$ (so that it is easily identifiable as a variable). A typical variable name is: $$Procedure_Start_Date. Aggregation type. This entry creates specific functionality for the variable and determines how it stores data. For example, with an aggregation type of Max, the value stored in the repository at the end of each session run would be the maximum value across ALL records until the value is deleted. Initial value. This value is used during the first session run when there is no corresponding and overriding parameter file. This value is also used if the stored repository value is deleted. If no initial value is identified, then a data-type specific default value is used.

Variable values are not stored in the repository when the session:
q q q

Fails to complete. Is configured for a test load. Is a debug session.


BEST PRACTICES 424 of 954

INFORMATICA CONFIDENTIAL

Runs in debug mode and is configured to discard session output.

Order of Evaluation
The start value is the value of the variable at the start of the session. The start value can be a value defined in the parameter file for the variable, a value saved in the repository from the previous run of the session, a user-defined initial value for the variable, or the default value based on the variable data type. The Integration Service looks for the start value in the following order: 1. 2. 3. 4. Value in session parameter file Value saved in the repository Initial value Default value

Mapping Parameters and Variables


Since parameter values do not change over the course of the session run, the value used is based on:
q q q

Value in session parameter file Initial value Default value

Once defined, mapping parameters and variables can be used in the Expression Editor section of the following transformations:
q q q q q

Expression Filter Router Update Strategy Aggregator

Mapping parameters and variables also can be used within the Source Qualifier in the SQL query, user-defined join, and source filter sections, as well as in a SQL override in the lookup transformation.

Guidelines for Creating Parameter Files


Use the following guidelines when creating parameter files:
q

Enter folder names for non-unique session names. When a session name exists more than once in a repository, enter the folder name to indicate the location of the session. Create one or more parameter files. Assign parameter files to workflows, worklets, and sessions individually. Specify the same parameter file for all of these tasks or create several parameter files. If including parameter and variable information for more than one session in the file, create a new section for each session. The folder name is optional. [folder_name.session_name] parameter_name=value

INFORMATICA CONFIDENTIAL

BEST PRACTICES

425 of 954

variable_name=value mapplet_name.parameter_name=value [folder2_name.session_name] parameter_name=value variable_name=value mapplet_name.parameter_name=value


q

Specify headings in any order. Place headings in any order in the parameter file. However, if defining the same parameter or variable more than once in the file, the Integration Service assigns the parameter or variable value using the first instance of the parameter or variable. Specify parameters and variables in any order. Below each heading, the parameters and variables can be specified in any order. When defining parameter values, do not use unnecessary line breaks or spaces. The Integration Service may interpret additional spaces as part of the value. List all necessary mapping parameters and variables. Values entered for mapping parameters and variables become the start value for parameters and variables in a mapping. Mapping parameter and variable names are not case sensitive. List all session parameters. Session parameters do not have default values. An undefined session parameter can cause the session to fail. Session parameter names are not case sensitive. Use correct date formats for datetime values. When entering datetime values, use the following date formats: MM/DD/RR MM/DD/RR HH24:MI:SS MM/DD/YYYY MM/DD/YYYY HH24:MI:SS

Do not enclose parameters or variables in quotes. The Integration Service interprets everything after the equal sign as part of the value. Do enclose parameters in single quotes. In a Source Qualifier SQL Override use single quotes if the parameter represents a string or date/time value to be used in the SQL Override. Precede parameters and variables created in mapplets with the mapplet name as follows: mapplet_name.parameter_name=value mapplet2_name.variable_name=value

Sample: Parameter Files and Session Parameters


Parameter files, along with session parameters, allow you to change certain values between sessions. A
INFORMATICA CONFIDENTIAL BEST PRACTICES 426 of 954

commonly-used feature is the ability to create user-defined database connection session parameters to reuse sessions for different relational sources or targets. Use session parameters in the session properties, and then define the parameters in a parameter file. To do this, name all database connection session parameters with the prefix $DBConnection, followed by any alphanumeric and underscore characters. Session parameters and parameter files help reduce the overhead of creating multiple mappings when only certain attributes of a mapping need to be changed.

Using Parameters in Source Qualifiers


Another commonly used feature is the ability to create parameters in the source qualifiers, which allows you to reuse the same mapping, with different sessions, to extract specified data from the parameter files the session references. Moreover, there may be a time when it is necessary to create a mapping that will create a parameter file and the second mapping to use that parameter file created from the first mapping. The second mapping pulls the data using a parameter in the Source Qualifier transformation, which reads the parameter from the parameter file created in the first mapping. In the first case, the idea is to build a mapping that creates the flat file, which is a parameter file for another session to use.

Sample: Variables and Parameters in an Incremental Strategy


Variables and parameters can enhance incremental strategies. The following example uses a mapping variable, an expression transformation object, and a parameter file for restarting.

Scenario
Company X wants to start with an initial load of all data, but wants subsequent process runs to select only new information. The environment data has an inherent Post_Date that is defined within a column named Date_Entered that can be used. The process will run once every twenty-four hours.

Sample Solution
Create a mapping with source and target objects. From the menu create a new mapping variable named $ $Post_Date with the following attributes:
q q q q

TYPE Variable DATATYPE Date/Time AGGREGATION TYPE MAX INITIAL VALUE 01/01/1900

Note that there is no need to encapsulate the INITIAL VALUE with quotation marks. However, if this value is used within the Source Qualifier SQL, it may be necessary to use native RDBMS functions to convert (e.g., TO DATE (--,--)). Within the Source Qualifier Transformation, use the following in the Source_Filter Attribute: DATE_ENTERED > to_Date(' $$Post_Date','MM/DD/YYYY HH24:MI:SS') [please be aware that this sample refers to Oracle as the source RDBMS]. Also note that the initial value 01/01/1900 will be expanded by the Integration Service to 01/01/1900 00:00:00, hence the need to convert the parameter to a datetime. The next step is to forward $$Post_Date and Date_Entered to an Expression transformation. This is where the function for setting the variable will reside. An output port named Post_Date is created with a data type of date/ time. In the expression code section, place the following function: SETMAXVARIABLE($$Post_Date,DATE_ENTERED)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

427 of 954

The function evaluates each value for DATE_ENTERED and updates the variable with the Max value to be passed forward. For example:

DATE_ENTERED 9/1/2000 10/30/2001 9/2/2000

Resultant POST_DATE 9/1/2000 10/30/2001 10/30/2001

Consider the following with regard to the functionality: 1. In order for the function to assign a value, and ultimately store it in the repository, the port must be connected to a downstream object. It need not go to the target, but it must go to another Expression Transformation. The reason is that the memory will not be instantiated unless it is used in a downstream transformation object. 2. In order for the function to work correctly, the rows have to be marked for insert. If the mapping is an update-only mapping (i.e., Treat Rows As is set to Update in the session properties) the function will not work. In this case, make the session Data Driven and add an Update Strategy after the transformation containing the SETMAXVARIABLE function, but before the Target. 3. If the intent is to store the original Date_Entered per row and not the evaluated date value, then add an ORDER BY clause to the Source Qualifier. This way, the dates are processed and set in order and data is preserved.

The first time this mapping is run, the SQL will select from the source where Date_Entered is > 01/01/1900 providing an initial load. As data flows through the mapping, the variable gets updated to the Max Date_Entered it encounters. Upon successful completion of the session, the variable is updated in the repository for use in the next session run. To view the current value for a particular variable associated with the session, right-click on the session in the Workflow Monitor and choose View Persistent Values. The following graphic shows that after the initial run, the Max Date_Entered was 02/03/1998. The next time this session is run, based on the variable in the Source Qualifier Filter, only sources where Date_Entered > 02/03/1998 will be processed.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

428 of 954

Resetting or Overriding Persistent Values


To reset the persistent value to the initial value declared in the mapping, view the persistent value from Workflow Manager (see graphic above) and press Delete Values. This deletes the stored value from the repository, causing the Order of Evaluation to use the Initial Value declared from the mapping. If a session run is needed for a specific date, use a parameter file. There are two basic ways to accomplish this:
q

Create a generic parameter file, place it on the server, and point all sessions to that parameter file. A session may (or may not) have a variable, and the parameter file need not have variables and parameters defined for every session using the parameter file. To override the variable, either change, uncomment, or delete the variable in the parameter file. Run pmcmd for that session, but declare the specific parameter file within the pmcmd command.

Configuring the Parameter File Location


Specify the parameter filename and directory in the workflow or session properties. To enter a parameter file in the workflow or session properties:
q q q

Select either the Workflow or Session, choose, Edit, and click the Properties tab. Enter the parameter directory and name in the Parameter Filename field. Enter either a direct path or a server variable directory. Use the appropriate delimiter for the Integration Service operating system.

The following graphic shows the parameter filename and location specified in the session task.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

429 of 954

The next graphic shows the parameter filename and location specified in the Workflow.

In this example, after the initial session is run, the parameter file contents may look like: [Test.s_Incremental] ;$$Post_Date= By using the semicolon, the variable override is ignored and the Initial Value or Stored Value is used. If, in the subsequent run, the data processing date needs to be set to a specific date (for example: 04/21/2001), then a simple Perl script or manual change can update the parameter file to: [Test.s_Incremental] $$Post_Date=04/21/2001 Upon running the sessions, the order of evaluation looks to the parameter file first, sees a valid variable and value and uses that value for the session run. After successful completion, run another script to reset the parameter file.

Sample: Using Session and Mapping Parameters in Multiple Database Environments

INFORMATICA CONFIDENTIAL

BEST PRACTICES

430 of 954

Reusable mappings that can source a common table definition across multiple databases, regardless of differing environmental definitions (e.g., instances, schemas, user/logins), are required in a multiple database environment.

Scenario
Company X maintains five Oracle database instances. All instances have a common table definition for sales orders, but each instance has a unique instance name, schema, and login.

DB Instance ORC1 ORC99 HALC UGLY GORF

Schema aardso environ hitme snakepit gmer

Table orders orders order_done orders orders

User Sam Help Hi Punch Brer

Password max me Lois Judy Rabbit

Each sales order table has a different name, but the same definition:

ORDER_ID DATE_ENTERED DATE_PROMISED DATE_SHIPPED EMPLOYEE_ID CUSTOMER_ID SALES_TAX_RATE STORE_ID

NUMBER (28) DATE DATE DATE NUMBER (28) NUMBER (28) NUMBER (5,4) NUMBER (28)

NOT NULL, NOT NULL, NOT NULL, NOT NULL, NOT NULL, NOT NULL, NOT NULL, NOT NULL

Sample Solution
Using Workflow Manager, create multiple relational connections. In this example, the strings are named according to the DB Instance name. Using Designer, create the mapping that sources the commonly defined table. Then create a Mapping Parameter named $$Source_Schema_Table with the following attributes:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

431 of 954

Note that the parameter attributes vary based on the specific environment. Also, the initial value is not required since this solution uses parameter files. Open the Source Qualifier and use the mapping parameter in the SQL Override as shown in the following graphic.

Open the Expression Editor and select Generate SQL. The generated SQL statement shows the columns.
INFORMATICA CONFIDENTIAL BEST PRACTICES 432 of 954

Override the table names in the SQL statement with the mapping parameter. Using Workflow Manager, create a session based on this mapping. Within the Source Database connection dropdown box, choose the following parameter: $DBConnection_Source. Point the target to the corresponding target and finish. Now create the parameter files. In this example, there are five separate parameter files.

Parmfile1.txt
[Test.s_Incremental_SOURCE_CHANGES] $$Source_Schema_Table=aardso.orders $DBConnection_Source= ORC1

Parmfile2.txt
[Test.s_Incremental_SOURCE_CHANGES] $$Source_Schema_Table=environ.orders $DBConnection_Source= ORC99

Parmfile3.txt
[Test.s_Incremental_SOURCE_CHANGES] $$Source_Schema_Table=hitme.order_done $DBConnection_Source= HALC

Parmfile4.txt
[Test.s_Incremental_SOURCE_CHANGES] $$Source_Schema_Table=snakepit.orders $DBConnection_Source= UGLY

Parmfile5.txt
[Test.s_Incremental_SOURCE_CHANGES] $$Source_Schema_Table= gmer.orders

INFORMATICA CONFIDENTIAL

BEST PRACTICES

433 of 954

$DBConnection_Source= GORF Use pmcmd to run the five sessions in parallel. The syntax for pmcmd for starting sessions with a particular parameter file is as follows: pmcmd startworkflow -s serveraddress:portno -u Username -p Password -paramfile parmfilename s_Incremental You may also use "-pv pwdvariable" if the named environment variable contains the encrypted form of the actual password.

Notes on Using Parameter Files with Startworkflow


When starting a workflow, you can optionally enter the directory and name of a parameter file. The PowerCenter Integration Service runs the workflow using the parameters in the file specified. For UNIX shell users, enclose the parameter file name in single quotes: -paramfile '$PMRootDir/myfile.txt' For Windows command prompt users, the parameter file name cannot have beginning or trailing spaces. If the name includes spaces, enclose the file name in double quotes: -paramfile "$PMRootDir\my file.txt" Note: When writing a pmcmd command that includes a parameter file located on another machine, use the backslash (\) with the dollar sign ($). This ensures that the machine where the variable is defined expands the server variable. pmcmd startworkflow -uv USERNAME -pv PASSWORD -s SALES:6258 -f east -w wSalesAvg -paramfile '\ $PMRootDir/myfile.txt' In the event that it is necessary to run the same workflow with different parameter files, use the following five separate commands: pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES paramfile \$PMRootDir\ParmFiles\Parmfile1.txt 1 1 pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES paramfile \$PMRootDir\ParmFiles\Parmfile2.txt 1 1 pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES paramfile \$PMRootDir\ParmFiles\Parmfile3.txt 1 1 pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES paramfile \$PMRootDir\ParmFiles\Parmfile4.txt 1 1 pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES paramfile \$PMRootDir\ParmFiles\Parmfile5.txt 1 1 Alternatively, run the sessions in sequence with one parameter file. In this case, a pre- or post-session script can change the parameter file for the next session.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

434 of 954

Dynamically creating Parameter Files with a mapping


Using advanced techniques a PowerCenter mapping can be built that produces as a target file a parameter file (. parm) that can be referenced by other mappings and sessions. When many mappings use the same parameter file it is desirable to be able to easily re-create the file when mapping parameters are changed or updated. This also can be beneficial when parameters change from run to run. There are a few different methods of creating a parameter file with a mapping. There is a mapping template example on the my.informatica.com that illustrates a method of using a PowerCenter mapping to source from a process table containing mapping parameters and to create a parameter file. This same feat can be accomplished also by sourcing a flat file in a parameter file format with code characters in the fields to be altered. [folder_name.session_name] parameter_name= <parameter_code> variable_name=value mapplet_name.parameter_name=value [folder2_name.session_name] parameter_name= <parameter_code> variable_name=value mapplet_name.parameter_name=value In place of the text <parameter_code> one could place the text filename_<timestamp>.dat. The mapping would then perform a string replace wherever the text <timestamp> occurred and the output might look like: Src_File_Name= filename_20080622.dat This method works well when values change often and parameter groupings utilize different parameter sets. The overall benefits of using this method are such that if many mappings use the same parameter file, changes can be made by updating the source table and recreating the file. Using this process is faster than manually updating the file line by line.

Final Tips for Parameters and Parameter Files


Use a single parameter file to group parameter information for related sessions. When sessions are likely to use the same database connection or directory, you might want to include them in the same parameter file. When connections or directories change, you can update information for all sessions by editing one parameter file. Sometimes you reuse session parameters in a cycle. For example, you might run a session against a sales database everyday, but run the same session against sales and marketing databases once a week. You can create separate parameter files for each session run. Instead of changing the parameter file in the session properties each time you run the weekly session, use pmcmd to specify the parameter file to use when you start the session.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

435 of 954

Use reject file and session log parameters in conjunction with target file or target database connection parameters. When you use a target file or target database connection parameter with a session, you can keep track of reject files by using a reject file parameter. You can also use the session log parameter to write the session log to the target machine. Use a resource to verify the session runs on a node that has access to the parameter file. In the Administration Console, you can define a file resource for each node that has access to the parameter file and configure the Integration Service to check resources. Then, edit the session that uses the parameter file and assign the resource. When you run the workflow, the Integration Service runs the session with the required resource on a node that has the resource available. Save all parameter files in one of the process variable directories. If you keep all parameter files in one of the process variable directories, such as $SourceFileDir, use the process variable in the session property sheet. If you need to move the source and parameter files at a later date, you can update all sessions by changing the process variable to point to the new directory.

Last updated: 29-May-08 17:43

INFORMATICA CONFIDENTIAL

BEST PRACTICES

436 of 954

Using PowerCenter with UDB Challenge


Universal Database (UDB) is a database platform that can be used to run PowerCenter repositories and act as source and target databases for PowerCenter mappings. Like any software, it has its own way of doing things. It is important to understand these behaviors so as to configure the environment correctly for implementing PowerCenter and other Informatica products with this database platform. This Best Practice offers a number of tips for using UDB with PowerCenter.

Description
UDB Overview
UDB is used for a variety of purposes and with various environments. UDB servers run on Windows, OS/2, AS/400 and UNIX-based systems like AIX, Solaris, and HP-UX. UDB supports two independent types of parallelism: symmetric multi-processing (SMP) and massively parallel processing (MPP). Enterprise-Extended Edition (EEE) is the most common UDB edition used in conjunction with the Informatica product suite. UDB EEE introduces a dimension of parallelism that can be scaled to very high performance. A UDB EEE database can be partitioned across multiple machines that are connected by a network or a high-speed switch. Additional machines can be added to an EEE system as application requirements grow. The individual machines participating in an EEE installation can be either uniprocessors or symmetric multiprocessors.

Connection Setup
You must set up a remote database connection to connect to DB2 UDB via PowerCenter. This is necessary because DB2 UDB sets a very small limit on the number of attachments per user to the shared memory segments when the user is using the local (or indirect) connection/protocol. The PowerCenter server runs into this limit when it is acting as the database agent or user. This is especially apparent when the repository is installed on DB2 and the target data source is on the same DB2 database. The local protocol limit will definitely be reached when using the same connection node
INFORMATICA CONFIDENTIAL BEST PRACTICES 437 of 954

for the repository via the PowerCenter Server and for the targets. This occurs when the session is executed and the server sends requests for multiple agents to be launched. Whenever the limit on number of database agents is reached, the following error occurs: CMN_1022 [[IBM][CLI Driver] SQL1224N A database agent could not be started to service a request, or was terminated as a result of a database system shutdown or a force command. SQLSTATE=55032] The following recommendations may resolve this problem:
q q

Increase the number of connections permitted by DB2. Catalog the database as if it were remote. (For information of how to catalog database with remote node refer Knowledgebase id 14745 at my.Informatica. com support Knowledgebase) Be sure to close connections when programming exceptions occur. Verify that connections obtained in one method are returned to the pool via close() (The PowerCenter Server is very likely already doing this). Verify that your application does not try to access pre-empted connections (i. e., idle connections that are now used by other resources).

q q

q q

DB2 Timestamp
DB2 has a timestamp data type that is precise to the microsecond and uses a 26character format, as follows: YYYY-MM-DD-HH.MI.SS.MICROS (where MICROS after the last period recommends six decimals places of second) The PowerCenter Date/Time datatype only supports precision to the second (using a 19 character format), so under normal circumstances when a timestamp source is read into PowerCenter, the six decimal places after the second are lost. This is sufficient for most data warehousing applications but can cause significant problems where this timestamp is used as part of a key. If the MICROS need to be retained, this can be accomplished by changing the format of the column from a timestamp data type to a character 26 in the source and target definitions. When the timestamp is read from DB2, the timestamp will be read in and converted to character in the YYYY-MM-DD-HH.MI.SS.MICROS format. Likewise,
INFORMATICA CONFIDENTIAL BEST PRACTICES 438 of 954

when writing to a timestamp, pass the date as a character in the YYYY-MM-DD-HH.MI. SS.MICROS format. If this format is not retained, the records are likely to be rejected due to an invalid date format error. It is also possible to maintain the timestamp correctly using the timestamp data type itself. Setting a flag at the PowerCenter Server level does this; the technique is described in Knowledge Base article 10220 at my.Informatica.com.

Importing Sources or Targets


If the value of the DB2 system variable APPLHEAPSZ is too small when you use the Designer to import sources/targets from a DB2 database, the Designer reports an error accessing the repository. The Designer status bar displays the following message: SQL Error:[IBM][CLI Driver][DB2]SQL0954C: Not enough storage is available in the application heap to process the statement. If you receive this error, increase the value of the APPLHEAPSZ variable for your DB2 operating system. APPLHEAPSZ is the application heap size (in 4KB pages) for each process using the database.

Unsupported Datatypes
PowerMart and PowerCenter do not support the following DB2 datatypes:
q q q q

Dbclob Blob Clob Real

DB2 External Loaders


The DB2 EE and DB2 EEE external loaders can both perform insert and replace operations on targets. Both can also restart or terminate load operations.
q

The DB2 EE external loader invokes the db2load executable located in the PowerCenter Server installation directory. The DB2 EE external loader can load data to a DB2 server on a machine that is remote to the PowerCenter Server.
BEST PRACTICES 439 of 954

INFORMATICA CONFIDENTIAL

The DB2 EEE external loader invokes the IBM DB2 Autoloader program to load data. The Autoloader program uses the db2atld executable. The DB2 EEE external loader can partition data and load the partitioned data simultaneously to the corresponding database partitions. When you use the DB2 EEE external loader, the PowerCenter Server and theDB2 EEE server must be on the same machine.

The DB2 external loaders load from a delimited flat file. Be sure that the target table columns are wide enough to store all of the data. If you configure multiple targets in the same pipeline to use DB2 external loaders, each loader must load to a different tablespace on the target database. For information on selecting external loaders, see "Configuring External Loading in a Session" in the PowerCenter User Guide.

Setting DB2 External Loader Operation Modes


DB2 operation modes specify the type of load the external loader runs. You can configure the DB2 EE or DB2 EEE external loader to run in any one of the following operation modes:
q q

Insert. Adds loaded data to the table without changing existing table data. Replace. Deletes all existing data from the table, and inserts the loaded data. The table and index definitions do not change. Restart. Restarts a previously interrupted load operation. Terminate. Terminates a previously interrupted load operation and rolls back the operation to the starting point, even if consistency points were passed. The tablespaces return to normal state, and all table objects are made consistent.

q q

Configuring Authorities, Privileges, and Permissions


When you load data to a DB2 database using either the DB2 EE or DB2 EEE external loader, you must have the correct authority levels and privileges to load data into to the database tables. DB2 privileges allow you to create or access database resources. Authority levels provide a method of grouping privileges and higher-level database manager maintenance and utility operations. Together, these functions control access to the database manager and its database objects. You can access only those objects for which you have the required privilege or authority. To load data into a table, you must have one of the following authorities:
INFORMATICA CONFIDENTIAL BEST PRACTICES 440 of 954

q q q

SYSADM authority DBADM authority LOAD authority on the database, with INSERT privilege

In addition, you must have proper read access and read/write permissions:
q

The database instance owner must have read access to the external loader input files. If you use run DB2 as a service on Windows, you must configure the service start account with a user account that has read/write permissions to use LAN resources, including drives, directories, and files. If you load to DB2 EEE, the database instance owner must have write access to the load dump file and the load temporary file.

Remember, the target file must be delimited when using the DB2 AutoLoader.

Guidelines for Performance Tuning


You can achieve numerous performance improvements by properly configuring the database manager, database, and tablespace container and parameter settings. For example, MAXFILOP is one of the database configuration parameters that you can tune. The default value for MAXFILOP is far too small for most databases. When this value is too small, UDB spends a lot of extra CPU processing time closing and opening files. To resolve this problem, increase MAXFILOP value until UDB stops closing files. You must also have enough DB2 agents available to process the workload based on the number of users accessing the database. Incrementally increase the value of MAXAGENTS until agents are not stolen from another application. Moreover, sufficient memory allocated to the CATALOGCACHE_SZ database configuration parameter also benefits the database. If the value of catalog cache heap is greater than zero, both DBHEAP and CATALOGCACHE_SZ should be proportionally increased. In UDB, the LOCKTIMEOUT default value is 1. In a data warehouse database, set this value to 60 seconds. Remember to define TEMPSPACE tablespaces so that they have at least 3 or 4 containers across different disks, and set the PREFETCHSIZE to a multiple of EXTENTSIZE, where the multiplier is equal to the number of containers. Doing so will enable parallel I/O for larger sorts, joins, and other database functions requiring substantial TEMPSPACE space.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

441 of 954

In UDB, LOGBUFSZ value of 8 is too small. Try setting it to 128. Also, set an INTRA_PARALLEL value of YES for CPU parallelism. The database configuration parameter DFT_DEGREE should be set to a value between ANY and 1 depending on the number of CPUs available and number of processes that will be running simultaneously. Setting the DFT_DEGREE to ANY can prove to be a CPU hogger since one process can take up all the processing power with this setting. Setting it to one does not make sense as there is no parallelism in one. Note: DFT_DEGREE and INTRA_PARALLEL are applicable only for EEE DB. Data warehouse databases perform numerous sorts, many of which can be very large. SORTHEAP memory is also used for hash joins, which a surprising number of DB2 users fail to enable. To do so, use the db2set command to set environment variable DB2_HASH_JOIN=ON. For a data warehouse database, at a minimum, double or triple the SHEAPTHRES (to between 40,000 and 60,000) and set the SORTHEAP size between 4,096 and 8,192. If real memory is available, some clients use even larger values for these configuration parameters. SQL is very complex in a data warehouse environment and often consumes large quantities of CPU and I/O resources. Therefore, set DFT_QUERYOPT to 7 or 9. UDB uses NUM_IO_CLEANERS for writing to TEMPSPACE, temporary intermediate tables, index creations, and more. SET NUM_IO_CLEANERS equal to the number of CPUs on the UDB server and focus on your disk layout strategy instead. Lastly, for RAID devices where several disks appear as one to the operating system, be sure to do the following: 1. db2set DB2_STRIPED_CONTAINERS=YES (do this before creating tablespaces or before a redirected restore) 2. db2set DB2_PARALLEL_IO=* (or use TablespaceID numbers for tablespaces residing on the RAID devices for example DB2_PARALLEL_IO=4,5,6,7,8,10,12,13) 3. Alter the tablespace PREFETCHSIZE for each tablespace residing on RAID devices such that the PREFETCHSIZE is a multiple of the EXTENTSIZE.

Database Locks and Performance Problems


When working in an environment with many users that target a DB2 UDB database,
INFORMATICA CONFIDENTIAL BEST PRACTICES 442 of 954

you may experience slow and erratic behavior resulting from the way UDB handles database locks. Out of the box, DB2 UDB database and client connections are configured on the assumption that they will be part of an OLTP system and place several locks on records and tables. Because PowerCenter typically works with OLAP systems where it is the only process writing to the database and users are primarily reading from the database, this default locking behavior can have a significant impact on performance Connections to DB2 UDB databases are set up using the DB2 Client Configuration utility. To minimize problems with the default settings, make the following changes to all remote clients accessing the database for read-only purposes. To help replicate these settings, you can export the settings from one client and then import the resulting file into all the other clients.
q

Enable Cursor Hold is the default setting for the Cursor Hold option. Edit the configuration settings and make sure the Enable Cursor Hold option is not checked. Connection Mode should be Shared, not Exclusive Isolation Level should be Read Uncommitted (the minimum level) or Read Committed (if updates by other applications are possible and dirty reads must be avoided)

q q

For setting the Isolation level to dirty read at the PowerCenter Server level, you can set a flag can at the PowerCenter configuration file. For details on this process, refer to the KB article 13575 in my.Informatica.com support knowledgebase. If you're not sure how to adjust these settings, launch the IBM DB2 Client Configuration utility, then highlight the database connection you use and select Properties. In Properties, select Settings and then select Advanced. You will see these options and their settings on the Transaction tab To export the settings from the main screen of the IBM DB2 client configuration utility, highlight the database connection you use, then select Export and all. Use the same process to import the settings on another client. If users run hand-coded queries against the target table using DB2's Command Center, be sure they know to use script mode and avoid interactive mode (by choosing the script tab instead of the interactive tab when writing queries). Interactive mode can lock returned records while script mode merely returns the result and does not hold them.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

443 of 954

If your target DB2 table is partitioned and resides across different nodes in DB2, you can use a target partition type DB Partitioning in PowerCenter session properties. When DB partitioning is selected, separate connections are opened directly to each node and the load starts in parallel. This improves performance and scalability.

Last updated: 13-Feb-07 17:14

INFORMATICA CONFIDENTIAL

BEST PRACTICES

444 of 954

Using Shortcut Keys in PowerCenter Designer Challenge


Using shortcuts and work-arounds to work as efficiently as possible in PowerCenter Mapping Designer and Workflow Manager.

Description
After you are familiar with the normal operation of PowerCenter Mapping Designer and Workflow Manager, you can use a variety of shortcuts to speed up their operation. PowerCenter provides two types of shortcuts:
q

keyboard shortcuts to edit repository objects and maneuver through the Mapping Designer and Workflow Manager as efficiently as possible, and shortcuts that simplify the maintenance of repository objects.

General Suggestions Maneuvering the Navigator Window


Follow these steps to open a folder with workspace open as well: 1. While highlighting the folder, click the Open folder icon. Note: Double-clicking the folder name only opens the folder if it has not yet been opened or connected to. 2. Alternatively, right-click the folder name, then click on Open.

Working with the Toolbar and Menubar


The toolbar contains commonly used features and functions within the various client tools. Using the toolbar is often faster than selecting commands from within the menubar.
q q

To add more toolbars, select Tools | Customize. Select the Toolbar tab to add or remove toolbars.

Follow these steps to use drop-down menus without the mouse: 1. Press and hold the <Alt> key. You will see an underline under one letter of each of the menu titles. 2. Press the underlined letter for the desired drop-down menu. For example, press 'r' for the 'Repository' drop-down menu.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

445 of 954

3. Press the underlined letter to select the command/operation you want. For example, press 't' for 'Close All Tools'. 4. Alternatively, after you have pressed the <Alt> key, use the right/left arrows to navigate across the menubar, and up/down arrows to expand and navigate through the drop-down menu.. Press Enter when the desired command is highlighted.
q

To create a customized toolbar for the functions you frequently use, press <Alt> <T> (expands the Tools drop-down menu) then <C> (for Customize). To delete customized icons, select Tools | Customize, and then remove the icons by dragging them directly off the toolbar To add an icon to an existing (or new) toolbar, select Tools | Customize and navigate to the Commands tab. Find your desired command, then "drag and drop" the icon onto your toolbar. To rearrange the toolbars, click and drag the toolbar to the new location. You can insert more than one toolbar at the top of the designer tool to avoid having the buttons go off the edge of the screen. Alternatively, you can position the toolbars at the bottom, side, or between the workspace and the message windows. To use a Docking\UnDocking window (e.g., Repository Navigator), double-click on the window's title bar. If you are having trouble docking the the window again, right-click somewhere in the white space of the runaway window (not the title bar) and make sure that the "Allow Docking" option is checked. When it is checked, drag the window to its proper place and, when an outline of where the window used to be appears, release the window.

q q

Keyboard Shortcuts
Use the following keyboard shortcuts to perform various operations in Mapping Designer and Workflow Manager.

To: Cancel editing in an object Check and uncheck a check box Copy text from an object onto a clipboard Cut text from an object onto the clipboard

Press: Esc Space Bar Ctrl+C Ctrl+X.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

446 of 954

Edit the text of an object Find all combination and list boxes Find tables or fields in the workspace Move around objects in a dialog box (When no objects are selected, this will pan within the workspace) Paste copied or cut text from the clipboard into an object Select the text of an object To start help

F2. Then move the cursor to the desired location Type the first letter of the list Ctrl+F Ctrl+directional arrows

Ctrl+V

F2 F1

Mapping Designer Navigating the Workspace


When using the "drag & drop" approach to create Foreign Key/Primary Key relationships between tables, be sure to start in the Foreign Key table and drag the key/field to the Primary Key table. Set the Key Type value to "NOT A KEY" prior to dragging. Follow these steps to quickly select multiple transformations: 1. Hold the mouse down and drag to view a box. 2. Be sure the box touches every object you want to select. The selected items will have a distinctive outline around them. 3. If you miss one or have an extra, you can hold down the <Shift> or <Ctrl> key and click the offending transformations one at a time. They will alternate between being selected and deselected each time you click on them. Follow these steps to copy and link fields between transformations: 1. You can select multiple ports when you are trying to link to the next transformation. 2. When you are linking multiple ports, they are linked in the same order as they are in the source transformation. You need to highlight the fields you want in the source transformation and hold the mouse button over the port name in the target transformation that corresponds to the source transformation port. 3. Use the Autolink function whenever possible. It is located under the Layout menu (or accessible by right-clicking somewhere in the workspace) of the Mapping Designer. 4. Autolink can link by name or position. PowerCenter version 6 and later gives you the option of entering prefixes or suffixes (when you click the 'More' button). This is especially helpful when you are trying to autolink from a Router transformation to some target transformation. For example, each group created in a Router has a distinct suffix number added to the port/ field name. To autolink, you need to choose the proper Router and Router group in the 'From Transformation' space. You also need to click the 'More' button and enter the appropriate suffix value. You must do both to create a link. 5. Autolink does not work if any of the fields in the 'To' transformation are already linked to another group or another stream. No error appears; the links are simply not created. Sometimes, a shared object is very close to (but not exactly) what you need. In this case, you may want to make a copy of the object with some minor alterations to suit your purposes. If you try to simply click and drag the object, it will ask you if you want to make a shortcut or it will be reusable every time. Follow these steps to make a non-reusable copy of a reusable object: 1. 2. 3. 4. 5. 6. Open the target folder. Select the object that you want to make a copy of, either in the source or target folder. Drag the object over the workspace. Press and hold the <Ctrl> key (the crosshairs symbol '+' will appear in a white box) Release the mouse button, then release the <Ctrl> key. A copy confirmation window and a copy wizard window appears.
BEST PRACTICES 447 of 954

INFORMATICA CONFIDENTIAL

7. The newly created transformation no longer says that it is reusable and you are free to make changes without affecting the original reusable object.

Editing Tables/Transformations
Follow these steps to move one port in a transformation: 1. Double-click the transformation and make sure you are in the "Ports" tab. (You go directly to the Ports tab if you doubleclick a port instead of the colored title bar.) 2. Highlight the port and click the up/down arrow button to reposition the port. 3. Or, highlight the port and then press <Alt><w> to move the port down or <Alt> <u> to move the port up. Note: You can hold down the <Alt> and hit the <w> or <u> multiple times to reposition the currently highlighted port downwards or upwards, respectively. Alternatively, you can accomplish the same thing by following these steps: 1. 2. 3. 4. Highlight the port you want to move by clicking the number beside the port. Grab onto the port by its number and continue holding down the left mouse button.. Drag the port to the desired location (the list of ports scrolls when you reach the end). A red line indicates the new location. When the red line is pointing to the desired location, release the mouse button. Note: You cannot move more than one port at a time with this method. See below for instructions on moving more than one port at a time. If you are using PowerCenter version 6.x, 7.x, or 8.x and the ports you are moving are adjacent, you can follow these steps to move more than one port at a time: 1. Highlight the ports you want to move by clicking the number beside the port while holding down the <Ctrl> key. 2. Use the up/down arrow buttons to move the ports to the desired location.
q q

To add a new field or port, first highlight an existing field or port, then press <Alt><f> to insert the new field/port below it. To validate a defined default value, first highlight the port you want to validate, and then press <Alt><v>. A message box will confirm the validity of the default value. After creating a new port, simply begin typing the name you wish to call the port. There is no need to to remove the default "NEWFIELD" text prior to labelling the new port. This method could also be applied when modifying existing port names. Simply highlight the existing port, by clicking onto the port number, and begin typing the modified name of the port. To prefix a port name, press <Home> to bring the cursor to the beginning of the port name. In addition, to add a suffix to a port name, press <End> to bring the curso to the end of the port name. Checkboxes can be checked (or unchecked) by highlighting the desired checkbox, and pressing SPACE bar to toggle the checkmark on and off.

Follow either of these steps to quickly open the Expression Editor of an output or variable port: 1. Highlight the expression so that there is a box around the cell and press <F2> followed by <F3>. 2. Or, highlight the expression so that there is a cursor somewhere in the expression, then press <F2>.
q q

To cancel an edit in the grid, press <Esc> so the changes are not saved. For all combo/drop-down list boxes, type the first letter on the list to select the item you want. For example, you can highlight a port's Data type box without displaying the drop-down. To change it to 'binary', type <b>. Then use the arrow keys to go down to the next port. This is very handy if you want to change all fields to string for example because using the up and down arrows and hitting a letter is much faster than opening the drop-down menu and making a choice each time. To copy a selected item in the grid, press <Ctrl><c>. To paste a selected item from the Clipboard to the grid, press <Ctrl><v>. To delete a selected field or port from the grid, press <Alt><c>. To copy a selected row from the grid, press <Alt><o>.
BEST PRACTICES 448 of 954

q q q q

INFORMATICA CONFIDENTIAL

To paste a selected row from the grid, press <Alt><p>.

You can use either of the following methods to delete more than one port at a time.
q q

-You can repeatedly hit the cut button; or You can highlight several records and then click the cut button. Use <Shift> to highlight many items in a row or <Ctrl> to highlight multiple non-contiguous items. Be sure to click on the number beside the port, not the port name while you are holding <Shift> or <Ctrl>.

Editing Expressions
Follow either of these steps to expedite validation of a newly created expression:
q

Click on the <Validate> button or press <Alt> and <v>. Note: This validates and leaves the Expression Editor open.

Or, press <OK> to initiate parsing/validating of the expression. The system closes the Expression Editor if the validation is successful. If you click OK once again in the "Expression parsed successfully" pop-up, the Expression Editor remains open.

There is little need to type in the Expression Editor. The tabs list all functions, ports, and variables that are currently available. If you want an item to appear in the Formula box, just double-click on it in the appropriate list on the left. This helps to avoid typographical errors and mistakes (such as including an output-only port name in an expression formula). In version 6.x and later, if you change a port name, PowerCenter automatically updates any expression that uses that port with the new name. Be careful about changing data types. Any expression using the port with the new data type may remain valid, but not perform as expected. If the change invalidates the expression, it will be detected when the object is saved or if the Expression Editor is active for that expression. The following table summarizes additional shortcut keys that are applicable only when working with Mapping Designer:

To: Add a new field or port Copy a row

Press Alt + F Alt + O

INFORMATICA CONFIDENTIAL

BEST PRACTICES

449 of 954

Cut a row Move current row down Move current row up Paste a row Validate the default value in a transformation Open the Expression Editor from the expression field To start the debugger

Alt + C Alt + W Alt + U Alt + P Alt + V F2, then press F3 F9

Repository Object Shortcuts


A repository object defined in a shared folder can be reused across folders by creating a shortcut (i.e., a dynamic link to the referenced object). Whenever possible, reuse source definitions, target definitions, reusable transformations, mapplets, and mappings. Reusing objects allows sharing complex mappings, mapplets or reusable transformations across folders, saves space in the repository, and reduces maintenance. Follow these steps to create a repository object shortcut: 1. Expand the shared folder. 2. Click and drag the object definition into the mapping that is open in the workspace. 3. As the cursor enters the workspace, the object icon appears along with a small curve; as an example, the icon should look like this:

4. A dialog box appears to confirm that you want to create a shortcut. If you want to copy an object from a shared folder instead of creating a shortcut, hold down the <Ctrl> key before dropping the object into the workspace.

Workflow Manager Navigating the Workspace


When editing a repository object or maneuvering around the Workflow Manager, use the following shortcuts to speed up the operation you are performing:

To: Create links

Press: Press Ctrl+F2 to select first task you want to link. Press Tab to select the rest of the tasks you want to link Press Ctrl+F2 again to link all the tasks you selected

INFORMATICA CONFIDENTIAL

BEST PRACTICES

450 of 954

Edit tasks name in the workspace Expand a selected node and all its children Move across to select tasks in the workspace Select multiple tasks

F2 SHIFT + * (use asterisk on numeric keypad) Tab Ctrl + Mouseclick

Repository Object Shortcuts


Mappings that reside in a shared folder can be reused within workflows by creating shortcut mappings. A set of workflow logic can be reused within workflows by creating a reusable worklet.

Last updated: 13-Feb-07 17:25

INFORMATICA CONFIDENTIAL

BEST PRACTICES

451 of 954

Working with JAVA Transformation Object Challenge


Occasionally special processing of data is required that is not easy to accomplish using existing PowerCenter transformation objects. Transformation tasks like looping through data 1 to x number of times is not a functionality native to the existing PowerCenter transformation objects. For these situations, the Java Transformation provides the ability to develop Java code with unlimited possibilities for transformation capabilities. This Best Practice addresses questions that are commonly raised about using JTX and how to make effective use of it, and supplements the existing PowerCenter documentation on the JTX.

Description
The Java Transformation (JTX) introduced in PowerCenter 8.0 provides a uniform means of entering and maintaining program code written in Java to be executed for every record being processed during a session run. The Java code is maintained, entered, and viewed within the PowerCenter Designer tool. Below is a summary of some of typical questions about JTX.

Is a JTX a passive or an active transformation?


A JTX can be either passive or active. When defining a JTX you must choose one or the other type. Once you make this choice you will not be able to change it without deleting the JTX, saving the repository and recreating the object. Hint: If you are working with a versioned repository, you will have to purge the deleted JTX from the repository before you can recreate it with the same name.

What parts of a typical Java class can be used in a JTX?


The following standard features can be used in a JTX:
q q

static initialization blocks can be defined on the tab Helper Code. import statements can be listed on the tab Import Packages.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

452 of 954

static variables of the Java class as a whole (i.e., counters for instances of this class) as well as non-static member variables (for every single instance) can be defined on the tab Helper Code. Auxiliary member functions or static functions may be declared and defined on the tab Helper Code. static final variables may be defined on the tab Helper Code. However, they are private by nature; no object of any other Java class will be able to utilize these. Auxiliary functions (static and dynamic) can be defined on the tab Helper Code.

Important Note: Before trying to start a session utilizing additional import clauses in the Java code, make sure that the environment variable CLASSPATH contains the necessary .jar files or directories before the PowerCenter Integration Service has been started. All non-static member variables declared on the tab Helper Code are automatically available to every partition of a partitioned session without any precautions. In other words, one object of the respective Java class that is generated by PowerCenter will be instantiated for every single instance of the JTX and for every session partition. For example, if you utilize two instances of the same reusable JTX and have set the session to run with three partitions, then six individual objects of that Java class will be instantiated for this session run.

What parts of a typical Java class cannot be utilized in a JTX?


The following standard features of Java are not available in a JTX:
q q q

Standard and user-defined constructors Standard and user-defined destructors Any kind of direct user-interface, be it a Swing GUI or a console-based user interface

What else cannot be done in a JTX?


One important note for a JTX is that you cannot retrieve, change, or utilize an existing DB connection in a JTX (such as a source connection, a target connection, or a relational connection to a LKP). If you would like to establish a database connection, use JDBC in the JTX. Make sure in this case that you provide the necessary

INFORMATICA CONFIDENTIAL

BEST PRACTICES

453 of 954

parameters by other means.

How can I substitute constructors and the like in a JTX?


User-defined constructors are mainly used to pass certain initialization values to a Java class that you want to process only once. The only way in a JTX to get this work done is to pass those parameters into the JTX as a normal port; then you define a boolean variable (initial value is true). For example, the name might be constructMissing on the Helper Code tab. The very first block in the On Input Row block will then look like this: if (constructMissing) { // do whatever you would do in the constructor constructMissing = false; } Interaction with users is mainly done to provide input values to some member functions of a class. This usually is not appropriate in a JTX because all input values should be provided by means of input records. If there is a need to enable immediate interaction with a user for one or several or all input records, use an inter-process communication mechanism (i.e., IPC) to establish communication between the Java class associated with the JTX and an environment available to a user. For example, if the actual check to be performed can only be determined at runtime, you might want to establish a JavaBeans communication between the JTX and the classes performing the actual checks. Beware, however, that this sort of mechanism causes great overhead and subsequently may decrease performance dramatically. Although in many cases such requirements indicate that the analysis process and the mapping design process have not been executed optimally.

How do I choose between an active and a passive JTX?


Use the following guidelines to identify whether you need an active or a passive JTX in your mapping:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

454 of 954

As a general rule of thumb, a passive JTX will usually execute faster than an active JTX . If one input record equals one output record of the JTX, you will probably want to use a passive JTX. If you have to produce a varying number of output records per input record (i. e., for some input values the JTX will generate one output record, for some values it will generate no output records, for some values it will generate two or even more output records) you will have to utilize an active JTX . There is no other choice. If you have to accumulate one or more input records before generating one or more output records, you will have to utilize an active JTX . There is no other choice. If you have to do some initialization work before processing the first input record, then this fact does in no way determine whether to utilize an active or a passive JTX. If you have to do some cleanup work after having processed the last input record, then this fact does in no way determine whether to utilize an active or a passive JTX.

If you have to generate one or more output records after the last input record has been processed, then you have to use an active JTX. There is no other choice except changing the mapping accordingly to produce these additional records by other means.

How do I set up a JTX and use it in a mapping?


As with most standard transformations you can either define a reusable JTX or an instance directly within a mapping. The following example will describe how to define a JTX in a mapping. For this example assume that the JTX has one input port of data type String and three output ports of type String, Integer, and Smallint. Note: As of version 8.1.1 the PowerCenter Designer is extremely sensitive regarding the port structure of a JTX; make sure you read and understand the Notes section below before designing your first JTX, otherwise you will encounter issues when trying to run a session associated to your mapping. 1. Click the button showing the java icon, then click on the background in the main window of the Mapping Designer. Choose whether to generate a passive or an active JTX (see How do I choose between an active and a passive JTX above). Remember, you cannot change this setting later. 2. Rename the JTX accordingly (i.e., rename it to JTX_SplitString).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

455 of 954

3. Go to the Ports tab; define all input-only ports in the Input Group, define all output-only and input-output ports in the Output Group. Make sure that every output-only and every input-output port is defined correctly. 4. Make sure you define the port structure correctly from the onset as changing data types of ports after the JTX has been saved to the repository will not always work. 5. Click Apply. 6. On the Properties tab you may want to change certain properties. For example, the setting "Is Partitionable" is mandatory if this session will be partitioned. Follow the hints in the lower part of the screen form that explain the selection lists in detail. 7. Activate the tab Java Code. Enter code pieces where necessary. Be aware that all ports marked as input-output ports on the Ports tab are automatically processed as pass-through ports by the Integration Service. You do not have to (and should not) enter any code referring to pass-through ports. See the Notes section below for more details. 8. Click the Compile link near the lower right corner of the screen form to compile the Java code you have entered. Check the output window at the lower border of the screen form for all compilation errors and work through each error message encountered; then click Compile again. Repeat this step as often as necessary until you can compile the Java code without any error messages. 9. Click OK. 10. Only connect ports of the same data type to every input-only or input-output port of the JTX. Connect output-only and input-output ports of the JTX only to ports of the same data type in transformations downstream. If any downstream transformation expects a different data type than the type of the respective output port of the JTX, insert an EXP to convert data types. Refer to the Notes below for more detail. 11. Save the mapping. Notes:
q

The primitive Java data types available in a JTX that can be used for ports of the JTX to connect to other transformations are Integer, Double, and Date/ Time. Date/time values are delivered to or by a JTX by means of a Java long value which indicates the difference of the respective date/time value to midnight, Jan 1st, 1970 (the so-called Epoch) in milliseconds; to interpret this value, utilize the appropriate methods of the Java class GregorianCalendar. Smallint values cannot be delivered to or by a JTX. The Java object data types available in a JTX that can be used for ports are String, byte arrays (for Binary ports), and BigDecimal (for Decimal values of arbitrary precision). In a JTX you check whether an input port has a NULL value by calling the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

456 of 954

function isNull("name_of_input_port"). If an input value is NULL, then you should explicitly set all depending output ports to NULL by calling setNull ("name_of_output_port"). Both functions take the name of the respective input / output port as a string.
q

You retrieve the value of an input port (provided this port is not NULL, see previous paragraph) simply by referring to the name of this port in your Java source code. For example, if you have two input ports i_1 and i_2 of type Integer and one output port o_1 of type String, then you might set the output value with a statement like this one: o_1 = "First value = " + i_1 + ", second value = " + i_2; In contrast to a Custom Transformation, it is not possible to retrieve the names, data types, and/or values of pass-through ports except if these passthrough ports have been defined on the Ports tab in advance. In other words, it is impossible for a JTX to adapt to its port structure at runtime (which would be necessary, for example, for something like a Sorter JTX). If you have to transfer 64-bit values into a JTX, deliver them to the JTX by means of a string representing the 64-bit number and convert this string into a Java long variable using the static method Long.parseLong(). Likewise, to deliver a 64-bit integer from a JTX to downstream transformations, convert the long variable to a string which will be an output port of the JTX (e.g. using the statement o_Int64 = "" + myLongVariable ). As of version 8.1.1, the PowerCenter Designer is very sensitive regarding data types of ports connected to a JTX. Supplying a JTX with not exactly the expected data types or connecting output ports to other transformations expecting other data types (i.e., a string instead of an integer) may cause the Designer to invalidate the mapping such that the only remedy is to delete the JTX, save the mapping, and re-create the JTX. Initialization Properties and Metadata Extensions can neither be defined nor retrieved in a JTX. The code entered on the Java Code sub-tab On Input Row is inserted into some other code; only this complete code constitutes the method execute() of the resulting Java class associated to the JTX (see output of the link "View Code" near the lower-right corner of the Java Code screen form). The same holds true for the code entered on the tabs On End Of Data and On Receiving Transactions with regard to the methods. This fact has a couple of implications which will be explained in more detail below. If you connect input and/or output ports to transformations with differing data types, you might get error messages during mapping validation. One such error message occurring quite often indicates that the byte code of the class cannot be retrieved from the repository. In this case, rectify port connections to all input and/or output ports of the JTX and edit the Java code (inserting one blank comment line usually suffices) and recompile the Java code again.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

457 of 954

The JTX (Java Transformation) doesn't currently allow pass-through ports. Thus they have to be simulated by splitting them up into one input port and one output port, then the values of all input ports have to be assigned to the respective output port. The key here is the input port of every pair of ports has to be in the Input Group while the respective output port has to be in the Output Group. If you do not do this, there is no warning in designer but it will not function correctly.

Where and how to insert what pieces of Java code into a JTX?
A JTX always contains a code skeleton that is generated by the Designer. Every piece of code written by a mapping designer is inserted into this skeleton at designated places. Because all these code pieces do not constitute the sole content of the respective functions, there are certain rules and recommendations as to how to write such code. As mentioned previously, a mapping designer can neither write his or her own constructor nor insert any code into the default constructor or the default destructor generated by the Designer. All initialization work can be done in either of the following two ways:
q q

as part of the static{} initialization block, by inserting code that in a standalone class would be part of the destructor into the tab On End Of Data, by inserting code that in a standalone class would be part of the constructor into the tab On Input Row.

The last case (constructor code being part of the On Input Row code) requires a little trick: constructor code is supposed to be executed once only, namely before the first method is called. In order to resemble this behavior, follow these steps: 1. On the tab Helper Code, define a boolean variable (i.e., constructorMissing) and initialize it to true. 2. At the beginning of the On Input Row code, insert code that looks like the following: if( constructorMissing) { // do whatever the constructor should have done

INFORMATICA CONFIDENTIAL

BEST PRACTICES

458 of 954

constructorMissing = false; } This will ensure that this piece of code is executed only once, namely directly before the very first input row is processed. The code pieces on the tabs On Input Row, On End Of Data, and On Receiving Transaction are embedded in other code. There is code that runs before the code entered here will execute, and there is more code to follow; for example, exceptions raised within code written by a developer will be caught here. As a mapping developer you cannot change this order, so you need to be aware of the following important implication. Suppose you are writing a Java class that performs some checks on an input record and, if the checks fail, issues an error message and then skips processing to the next record. Such a piece of code might look like this: if (firstCheckPerformed( inputRecord) && secondCheckPerformed( inputRecord)) { logMessage( ERROR: one of the two checks failed!); return; } // else insertIntoTarget( inputRecord); countOfSucceededRows ++; This code will not compile in a JTX because it would lead to unreachable code. Why? Because the return at the end of the if statement might enable the respective function (in this case, the method will have the name execute()) to ignore the subsequent code that is part of the framework created by the Designer.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

459 of 954

In order to make this code work in a JTX, change it to look like this: if (firstCheckPerformed( inputRecord) && secondCheckPerformed( inputRecord)) { } else { insertIntoTarget( inputRecord); logMessage( ERROR: one of the two checks failed!);

countOfSucceededRows ++; } The same principle (never use return in these code pieces) applies to all three tabs On Input Row, On End Of Data, and On Receiving Transaction. Another important point is that the code entered on the On Every Record tab is embedded in a try-catch block. So never include any try-catch code on this tab.

How fast does a JTX perform?


A JTX communicates with PowerCenter by means of JNI (Java Native Invocation). This mechanism has been defined by Sun Micro-systems in order to allow Java code to interact with dynamically linkable libraries. Though JNI has been designed to perform fast, it still creates some overhead to a session due to:
q

the additional process switches between the PowerCenter Integration Service and the Java Virtual Machine (JVM) that executes as another operating system process Java not being compiled to machine code but to portable byte code (although this has been largely remedied in the past years due to the introduction of JustIn-Time compilers) which is interpreted by the JVM The inherent complexity of the genuine object model in Java (except for most sorts of number types and characters everything in Java is an object that

INFORMATICA CONFIDENTIAL

BEST PRACTICES

460 of 954

occupies space and execution time). So it is obvious that a JTX cannot perform as fast as, for example, a carefully written Custom Transformation. The rule of thumb is for simple JTX to require approximately 50% more total running time than an EXP of comparable functionality. It can also be assumed that Java code utilizing several of the fairly complex standard classes will need even more total runtime when compared to an EXP performing the same tasks.

When should I use a JTX and when not?


As with any other standard transformation, a JTX has its advantages as well as disadvantages. The most significant disadvantages are:
q

The Designer is very sensitive in regards to the data types of ports that are connected to the ports of a JTX. However, most of the troubles arising from this sensitivity can be remedied rather easily by simply recompiling the Java code. Working with long values representing days and time within, for example, the GregorianCalendar can be extremely difficult to do and demanding in terms of runtime resources (memory, execution time). Date/time ports in PowerCenter are by far easier to use. So it is advisable to split up date/time ports into their individual components, such as year, month, and day, and to process these singular attributes within a JTX if needed. In general a JTX can reduce performance simply by the nature of the architecture. Only use a JTX when necessary. A JTX always has one input group and one output group. For example, it is impossible to write a Joiner as a JTX.

Significant advantages to using a JTX are:


q

Java knowledge and experience are generally easier to find than comparable skills in other languages. Prototyping with a JTX can be very fast. For example, setting up a simple JTX that calculates the calendar week and calendar year for a given date takes approximately 10-20 minutes. Writing Custom Transformations (even for easy tasks) can take several hours. Not every data integration environment has access to a C compiler used to compile Custom Transformations in C. Because PowerCenter is installed with
BEST PRACTICES 461 of 954

INFORMATICA CONFIDENTIAL

its own JDK, this problem will not arise with a JTX.

In Summary
q

If you need a transformation that adapts its processing behavior to its ports, a JTX is not the way to go. In such a case, write a Custom Transformation in C, C++, or Java to perform the necessary tasks. The CT API is considerably more complex than the JTX API, but it is also far more flexible. Use a JTX for development whenever a task cannot be easily completed using other standard options in PowerCenter (as long as performance requirements do not dictate otherwise). If performance measurements are slightly below expectations, try optimizing the Java code and the remainder of the mapping in order to increase processing speed.

Last updated: 04-Jun-08 19:14

INFORMATICA CONFIDENTIAL

BEST PRACTICES

462 of 954

Error Handling Process Challenge


For an error handling strategy to be implemented successfully, it must be integral to the load process as a whole. The method of implementation for the strategy will vary depending on the data integration requirements for each project. The resulting error handling process should however, always involve the following three steps: 1. 2. 3. Error identification Error retrieval Error correction

This Best Practice describes how each of these steps can be facilitated within the PowerCenter environment.

Description
A typical error handling process leverages the best-of-breed error management technology available in PowerCenter, such as: Relational database error logging Email notification of workflow failures Session error thresholds The reporting capabilities of PowerCenter Data Analyzer Data profiling

These capabilities can be integrated to facilitate error identification, retrieval, and correction as described in the flow chart below:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

463 of 954

Error Identification
The first step in the error handling process is error identification. Error identification is often achieved through the use of the ERROR() function within mappings, enablement of relational error logging in PowerCenter, and referential integrity constraints at the database. This approach ensures that row-level issues such as database errors (e.g., referential integrity failures), transformation errors, and business rule exceptions for which the ERROR() function was called are captured in relational error logging tables. Enabling the relational error logging functionality automatically writes row-level data to a set of four error handling tables (PMERR_MSG, PMERR_DATA, PMERR_TRANS, and PMERR_SESS). These tables can be centralized in the PowerCenter repository and store information such as error messages, error data, and source row data. Row-level errors trapped in this manner include any database errors, transformation errors, and business rule exceptions for which the ERROR() function was called within the mapping.

Error Retrieval
The second step in the error handling process is error retrieval. After errors have been captured in the PowerCenter repository, it is important to make their retrieval simple and automated so that the process is as efficient as possible. Data Analyzer can be customized to create error retrieval reports from the information stored in the PowerCenter repository. A typical error report prompts a user for the folder and workflow name, and returns a report with information such as the session, error message, and data that caused the error. In this way, the error is successfully captured in the repository and can be easily retrieved through a Data Analyzer report, or an email alert that identifies a user when a certain threshold is crossed (such as number of errors is greater than zero).

Error Correction
The final step in the error handling process is error correction. As PowerCenter automates the process of error identification, and Data Analyzer can be used to simplify error retrieval, error correction is straightforward. After retrieving an error through Data Analyzer, the error report (which contains information such as workflow name, session name, error date, error message, error data, and source row data) can be exported to various file formats including Microsoft Excel, Adobe PDF, CSV, and others. Upon retrieval of an error, the error report can be extracted into a supported format and emailed to a developer or DBA to resolve the issue, or it can be entered into a defect management tracking tool. The Data Analyzer interface supports emailing a report directly through the web-based interface to make the process even easier. For further automation, a report broadcasting rule that emails the error report to a developers inbox can be set up to run on a pre-defined schedule. After the developer or DBA identifies the condition that caused the error, a fix for the error can be implemented. The exact method of data correction depends on various factors such as the number of records with errors, data availability requirements per SLA, the level of data criticality to the business unit(s), and the type of error that occurred. Considerations made during error correction include: The owner of the data should always fix the data errors. For example, if the source data is coming from an external system, then the errors should be sent back to the source system to be fixed. In some situations, a simple re-execution of the session will reprocess the data. Does partial data that has been loaded into the target systems need to be backed-out in order to avoid duplicate processing of rows. Lastly, errors can also be corrected through a manual SQL load of the data. If the volume of errors is low, the rejected data can be easily exported to Microsoft Excel or CSV format and corrected in a spreadsheet from the Data Analyzer error reports. The corrected data can then be manually inserted into the target table using a SQL statement.

Any approach to correct erroneous data should be precisely documented and followed as a standard.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

464 of 954

If the data errors occur frequently, then the reprocessing process can be automated by designing a special mapping or session to correct the errors and load the corrected data into the ODS or staging area.

Data Profiling Option


For organizations that want to identify data irregularities post-load but do not want to reject such rows at load time, the PowerCenter Data Profiling option can be an important part of the error management solution. The PowerCenter Data Profiling option enables users to create data profiles through a wizard-driven GUI that provides profile reporting such as orphan record identification, business rule violation, and data irregularity identification (such as NULL or default values). The Data Profiling option comes with a license to use Data Analyzer reports that source the data profile warehouse to deliver data profiling information through an intuitive BI tool. This is a recommended best practice since error handling reports and data profile reports can be delivered to users through the same easy-to-use application.

Integrating Error Handling, Load Management, and Metadata


Error handling forms only one part of a data integration application. By necessity, it is tightly coupled to the load management process and the load metadata; it is the integration of all these approaches that ensures the system is sufficiently robust for successful operation and management. The flow chart below illustrates this in the end-to-end load process.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

465 of 954

INFORMATICA CONFIDENTIAL

BEST PRACTICES

466 of 954

Error handling underpins the data integration system from end-to-end. Each of the load components performs validation checks, the results of which must be reported to the operational team. These components are not just PowerCenter processes such as business rule and field validation, but cover the entire data integration architecture, for example: Process Validation. Are all the resources in place for the processing to begin (e.g., connectivity to source systems)? Source File Validation. Is the source file datestamp later than the previous load? File Check. Does the number of rows successfully loaded match the source rows read?

Last updated: 09-Feb-07 13:42

INFORMATICA CONFIDENTIAL

BEST PRACTICES

467 of 954

Error Handling Strategies - Data Warehousing Challenge


A key requirement for any successful data warehouse or data integration project is that it attains credibility within the user community. At the same time, it is imperative that the warehouse be as up-to-date as possible since the more recent the information derived from it is, the more relevant it is to the business operations of the organization, thereby providing the best opportunity to gain an advantage over the competition. Transactional systems can manage to function even with a certain amount of error since the impact of an individual transaction (in error) has a limited effect on the business figures as a whole, and corrections can be applied to erroneous data after the event (i.e., after the error has been identified). In data warehouse systems, however, any systematic error (e.g., for a particular load instance) not only affects a larger number of data items, but may potentially distort key reporting metrics. Such data cannot be left in the warehouse "until someone notices" because business decisions may be driven by such information. Therefore, it is important to proactively manage errors, identifying them before, or as, they occur. If errors occur, it is equally important either to prevent them from getting to the warehouse at all, or to remove them from the warehouse immediately (i.e., before the business tries to use the information in error). The types of error to consider include: Source data structures Sources presented out-of-sequence Old sources represented in error Incomplete source files Data-type errors for individual fields Unrealistic values (e.g., impossible dates) Business rule breaches Missing mandatory data O/S errors RDBMS errors

These cover both high-level (i.e., related to the process or a load as a whole) and low-level (i.e., field or column-related errors) concerns.

Description
In an ideal world, when an analysis is complete, you have a precise definition of source and target data; you can be sure that every source element was populated correctly, with meaningful values, never missing a value, and fulfilling all relational constraints. At the same time, source data sets always have a fixed structure, are always available on time (and in the correct order), and are never corrupted during transfer to the data warehouse. In addition, the OS and RDBMS never run out of resources, or have permissions and privileges change. Realistically, however, the operational applications are rarely able to cope with every possible business scenario or combination of events; operational systems crash, networks fall over, and users may not use the transactional systems in quite the way they were designed. The operational systems also typically need some flexibility to allow non-fixed data to be stored (typically as free-text comments). In every case, there is a risk that the source data does not match what the data warehouse expects. Because of the credibility issue, in-error data must not be propagated to the metrics and measures used by the business managers. If erroneous data does reach the warehouse, it must be identified and removed immediately (before the current version of the warehouse can be published). Preferably, error data should

INFORMATICA CONFIDENTIAL

BEST PRACTICES

468 of 954

be identified during the load process and prevented from reaching the warehouse at all. Ideally, erroneous source data should be identified before a load even begins, so that no resources are wasted trying to load it. As a principle, data errors should corrected at the source. As soon as any attempt is made to correct errors within the warehouse, there is a risk that the lineage and provenance of the data will be lost. From that point on, it becomes impossible to guarantee that a metric or data item came from a specific source via a specific chain of processes. As a by-product, adopting this principle also helps to tie both the end-users and those responsible for the source data into the warehouse process; source data staff understand that their professionalism directly affects the quality of the reports, and end-users become owners of their data. As a final consideration, error management (the implementation of an error handling strategy) complements and overlaps load management, data quality and key management, and operational processes and procedures. Load management processes record at a high-level if a load is unsuccessful; error management records the details of why the failure occurred. Quality management defines the criteria whereby data can be identified as in error; and error management