Best Practices: Table of Contents

Best Practices Configuration Management Migration Procedures Development Techniques Development FAQs Data Cleansing Data Connectivity Using PowerConnect for BW Integration Server Data Connectivity using PowerConnect for Mainframe Data Connectivity using PowerConnect for MQSeries Data Connectivity using PowerConnect for PeopleSoft Data Connectivity using PowerConnect for SAP Incremental Loads Mapping Design Metadata Reporting and Sharing Naming Conventions Session and Data Partitioning Using Parameters, Variables and Parameter Files Error Handling A Mapping Approach to Trapping Data Errors Design Error Handling Infrastructure Documenting Mappings Using Repository Reports Error Handling Strategies Using Shortcut Keys in PowerCenter Designer Object Management Creating Inventories of Reusable Objects & Mappings Operations

BP-1 BP-1 BP-1 BP-16 BP-16 BP-24 BP-29 BP-33 BP-36 BP-40 BP-46 BP-52 BP-57 BP-62 BP-67 BP-72 BP-75 BP-87 BP-87 BP-91 BP-94 BP-96 BP-107 BP-109 BP-109 BP-113

INFORMATICA CONFIDENTIAL

BEST PRACTICES

PAGE BP-i

Updating Repository Statistics Daily Operations Load Validation Third Party Scheduler Event Based Scheduling Repository Administration High Availability Performance Tuning Recommended Performance Tuning Procedures Performance Tuning Databases Performance Tuning UNIX Systems Performance Tuning Windows NT/2000 Systems Tuning Mappings for Better Performance Tuning Sessions for Better Performance Determining Bottlenecks Platform Configuration Advanced Client Configuration Options Advanced Server Configuration Options Platform Sizing Recovery Running Sessions in Recovery Mode Project Management Developing the Business Case Assessing the Business Case Defining and Prioritizing Requirements Developing a WBS Developing and Maintaining the Project Plan Managing the Project Lifecycle Security Configuring Security

BP-113 BP-117 BP-119 BP-122 BP-125 BP-126 BP-129 BP-131 BP-131 BP-133 BP-151 BP-157 BP-161 BP-170 BP-177 BP-182 BP-182 BP-184 BP-189 BP-193 BP-193 BP-199 BP-199 BP-201 BP-203 BP-205 BP-206 BP-208 BP-210 BP-210

PAGE BP-ii

BEST PRACTICES

INFORMATICA CONFIDENTIAL

Migration Procedures

Challenge To develop a migration strategy that ensures clean migration between development, test, QA, and production, thereby protecting the integrity of each of these environments as the system evolves. Description In every application deployment, a migration strategy must be formulated to ensure a clean migration between development, test, quality assurance, and production. The migration strategy is largely influenced by the technologies that are deployed to support the development and production environments. These technologies include the databases, the operating systems, and the available hardware. Informatica offers flexible migration techniques that can be adapted to fit the existing technology and architecture of various sites, rather than proposing a single fixed migration strategy. The means to migrate work from development to production depends largely on the repository environment, which is either: • • Standalone PowerCenter, or Distributed PowerCenter

This Best Practice describes several migration strategies, outlining the advantages and disadvantages of each. It also discusses an XML method provided in PowerCenter 5.1 to support migration in either a Standalone or a Distributed environment. Standalone PowerMart/PowerCenter In a standalone environment, all work is performed in a single Informatica repository that serves as the shared metadata store. In this standalone environment, segregating the workspaces ensures that the migration from development to production is seamless. Workspace segregation can be achieved by creating separate folders for each work area. For instance, we might build a single data mart for the finance division within a

INFORMATICA CONFIDENTIAL

BEST PRACTICES

PAGE BP-1

corporation. In this example, we would create a minimum of four folders to manage our metadata. The folders might look something like the following:

In this scenario, mappings are developed in the FINANCE_DEV folder. As development is completed on particular mappings, they will be copied one at a time to the FINANCE_TEST folder. New sessions will be created or copied for each mapping in the FINANCE_TEST folder. When unit testing has been completed successfully, the mappings are copied into the FINANCE_QA folder. This process continues until the mappings are integrated into the production schedule. At that point, new sessions will be created in the FINANCE_PROD folder, with the database connections adjusted to point to the production environment. Introducing shortcuts in a single standalone environment complicates the migration process, but offers an efficient method for centrally managing sources and targets. A common folder can be used for sharing reusable objects such as shared sources, target definitions, and reusable transformations. If a common folder is used, there should be one common folder for each environment (i.e., SHARED_DEV, SHARED_TEST, SHARED_QA, SHARED_PROD). Migration Example Process Copying the mappings into the next stage enables the user to promote the desired mapping to test, QA, or production at the lowest level of granularity. If the folder where the mapping is to be copied does not contain the referenced source/target tables or transformations, then these objects will automatically be copied along with the mapping. The advantage of this promotion strategy is that individual mappings can be promoted as soon as they are ready for production. However, because only one mapping at a time can be copied, promoting a large number of mappings into production would be very time consuming. Additional time is required to re-create or copy all sessions from scratch, especially if pre- or post-session scripts are used. On the initial move to production, if all mappings are completed, the entire FINANCE_QA folder could be copied and renamed to FINANCE_PROD. With this approach, it is not necessary to promote all mappings and sessions individually. After the initial migration, however, mappings will be promoted on a “case-by-case” basis.

PAGE BP-2

BEST PRACTICES

INFORMATICA CONFIDENTIAL

if any of the objects are active. COMMON_PROD). Create or copy a session in the Server Manager to run the mapping (make sure the mapping exists in the current repository first). skip to step 2 • • Create four common folders. link all of the input ports to the new shortcut. • If copying the mapping. If using shortcuts. open the appropriate test folder. Copy the shortcut objects into the COMMON_TEST folder. COMMON_QA. (COMMON_DEV. and drag and drop the mapping from the development folder into the test folder. first delete the old shortcut before linking the output ports. Using the old shortcut as a model. • In the PowerCenter Designer. follow the copy session wizard. follow these substeps. if not using shortcuts. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-3 . Copy the mapping from Development into Test. If using shortcuts. 2. 3. Using the old shortcut as a model. open it in the Designer and bring in the newly copied shortcut. if not using shortcuts. skip to step 4: • • • • Open the mapping that uses shortcuts. link all of the output ports to the new shortcut. first follow these substeps.Follow these steps to copy a mapping from Development to Test: 1. However. 4. one for each migration stage COMMON_TEST. Using the newly copied mapping.

• If creating the mapping. This can have negative performance implications. Implement appropriate security. and Production “environments” are stored in separate folders. QA. If Development or Test loads are running simultaneously with PAGE BP-4 BEST PRACTICES INFORMATICA CONFIDENTIAL . enter all the appropriate information in the Session Wizard. Test. Performance Implications in the Single Environment A disadvantage of the single environment approach is that even though the Development. In Test and Quality Assurance. they all reside on the same server. 5. In Production. the owner of the folders should be a user in the development group. change the owner of the Test/QA folders to a user in the Test/QA group. such as: • • • • In Development. Revoke all rights to Public other than Read for the Production folders. change the owner of the folders to a user in the Production group.

The first is that everything is moved at once (also an advantage). With a fully distributed approach. Distributed PowerCenter In a distributed environment. parameters/variables. transformations. moved into the Test repository. Often. FINANCE_QA. however. there may be 50 mappings in QA but only 40 of them are production-ready. Another disadvantage is the need to adjust server variables. hardware and software) for Development. contending with the pre-scheduled Production runs. Test.namely that maintenance is required to remove any unwanted or excess objects. FINANCE_DEV. This is the preferred method for handling Development to Production migrations. situations do arise where performance benchmarking with large volumes or other unusual circumstances can cause test loads to run overnight. There are three main techniques to migrate from Development to Production. Because each environment is segregated from the others. The mappings are created in the Development repository. the server machine may reach 100 percent utilization and Production performance will suffer. and Production. FINANCE_TEST. Everything will need to be set up correctly on the new server that will now host the repository. Each repository has a similar name for the folders in the standalone environment. Another advantage is the ability to automate this process without having users perform this process.e. and FINANCE_PROD. including source and target tables. For instance. The final advantage is that everything can be moved without breaking/corrupting any of the objects. There are. database connections. three distinct disadvantages to the repository copy method. which leads to the second disadvantage -. The 10 unready mappings are moved into production along with the 40 production-ready maps. independent environments (i. However. or Production. there are separate. and then eventually into the Production environment. mappings. QA. and sessions.. in our Finance example we would have four repositories. There are three ways to accomplish the Repository Copy method: INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-5 . sequences. work performed in Development cannot impact Test. ready or not. The trouble with this is that everything is moved. Production loads run late at night. For example. each involving some advantages and disadvantages: • • • Repository Copy Folder Copy Object Copy Repository Copy The main advantage to this approach is the ability to copy everything at once from one environment to another. and most Development and Test loads run during the day so this does not pose a problem. etc.Production loads. separate repositories provide the same function as the separate folders in the standalone environment described previously. QA.

the user must delete the current repository in the new location. Select the created . To perform this one needs to go the file menu of the Repository Manager and select Copy Repository. The following screen shot shows the dialog box used to input the new location information: To successfully perform the copy. To perform this function. Repository Backup and Restore The Backup and Restore Repository is another simple method of copying an entire repository. To ensure success. PMREP Using the PMREP commands is essentially the same as the Backup and Restore Repository method except that it is run from the command line. if a user was copying a repository from DEV to TEST.REP file containing all repository information.REP file to automatically restore the repository in the destination server. To restore the repository simply open the Repository Manager on the destination server and select Restore Repository from the File menu.• • • Copying the Repository Repository Backup and Restore PMREP Copying the Repository The repository copy command is probably the easiest method of migration. go to the File menu in the Repository Manager and select Backup Repository. This will create a . be sure to first delete any matching destination repositories. then the TEST repository must first be deleted using the Delete option in the Repository Manager to create room for the new repository. Then the Copy Repository routine must be run. The PMREP utilities can be utilized both from the Informatica Server and from any client machines connected to the server. since the Restore Repository option does not delete the current repository. For example. PAGE BP-6 BEST PRACTICES INFORMATICA CONFIDENTIAL . From there the user is prompted to choose the location to which the repository will be copied.

restore. and then clearing the Enable checkbox under the General tab. etc: After following one of the above procedures to migrate into Production. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-7 . • • Disable the sessions in the Server manager by opening the session properties. backup. Using the code example below as a model. follow these steps to convert the repository to Production: 1. Delete the sessions in the Server Manager and the mappings in the Designer. Disable sessions that schedule mappings that are not ready for Production or simply delete the mappings and sessions.The following table documents the available PMREP commands: The following is a sample of the command syntax used within a batch file to connect to and backup a repository. scripts can be written to be run on a daily basis to perform functions such as connect.

and so forth. follow these steps: 1. The advantages of Folder Copy are: • • • Easy to move the entire folder and all objects in it Detailed Wizard guides the user through the entire process There’s no need to update or alter any Database Connections. reusable transformations. such as: • • • • In Development.2. Modify the pre. sequences or server variables. developers (or the Repository Administrator) must manually delete these mappings from the new folder. If copying a folder. Edit each database connection by changing the connect string to point to the production sources and targets. then after the folder is copied. change the owner of the Test/QA folders to a user in the Test/QA group. In Production. 4. All source and target tables. and sessions are promoted at once. otherwise skip to step 2: PAGE BP-8 BEST PRACTICES INFORMATICA CONFIDENTIAL . Modify the database connection strings to point to the Production sources and targets. mappings. • • • In the Server Manager. If certain mappings are not ready. from QA to Production. everything in the folder must be ready to migrate forward. then the connect string will need to be modified appropriately. In Test and Quality Assurance. The repository is locked while Folder Copy is being performed.and post-session scripts. If using lookup transformations in the mappings and the connect string is anything other than $SOURCE or $TARGET. change the owner of the folders to a user in the Production group. for example. Therefore. If using shortcuts. 3.and post-session commands as necessary. select Database Connections from the Server Configuration menu. • In the Server Manager. Revoke all rights to Public other than Read for the Production folders. The disadvantages of Folder Copy are: • • User needs to be logged into multiple environments simultaneously. Implement appropriate security. open the session properties. follow these substeps. ensure that the owner of the folders is a user in the Development group. Folder Copy Copying an entire folder allows you to quickly promote all of the objects in the Development folder to Test. and from the General tab make the required changes to the pre.

2. If a folder with that name already exists.• • • In each of the dedicated repositories.) 3. (To copy the entire folder. Follow the Copy Folder Wizard steps. 4. Point the folder to the correct shared folder if one is being used: INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-9 . drag and drop the folder icon just under the repository level. Drag and drop the folder onto the production repository icon within the Navigator tree structure. it must be renamed. Copy the shortcut objects into the common folder in Production and make sure the shortcut has exactly the same name. create a common folder using exactly the same name and case as in the “source” repository. Open and connect to either the Repository Manager or Designer.

For additional information. Revoke all rights to Public other than Read for the Production folders.and post-session commands as necessary: • In the Server Manager. ensure the owner of the folders is a user in the Development group. Implement appropriate security: • • • • In Development. Object Copy Copying mappings into the next stage within a networked environment has many of the same advantages and disadvantages as in the standalone environment. Modify the pre. In Test and Quality Assurance. and from the General tab make the required changes to the pre.After performing the Folder Copy method. open the session properties. see the previous description of Object Copy for the standalone environment. In Production. change the owner of the Test/QA folders to a user in the Test/QA group. be sure to remember the following steps: 1. 2.and post-sessions scripts. Additional advantages and disadvantages of Object Copy in a distributed environment include: Advantages: • More granular control over objects PAGE BP-10 BEST PRACTICES INFORMATICA CONFIDENTIAL . change the owner of the folders to a user in the Production group. but the process of handling shortcuts is simplified in the networked environment.

• • In the Designer. • • If copying the mapping follow the copy session wizard. 2. connect to both the QA and Production repositories and open the appropriate folders in each.Disadvantages: • • 1. Copy the mapping from quality assurance (QA) into production. If creating the mapping. create a common folder with the exact same name and case. • • Much more work to deploy an entire group of objects Shortcuts must exist prior to importing/copying mappings If using shortcuts. enter all the appropriate information in the Session Wizard. Copy the shortcuts into the common folder in Production making sure the shortcut has the exact same name. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-11 . Drag and drop the mapping from QA into Production. Create or copy a session in the Server Manager to run the mapping (make sure the mapping exists in the current repository first). otherwise skip to step 2: In each of the dedicated repositories. 3. follow these substeps.

For recommendations on performing this copy procedure correctly. Recommendations Informatica recommends using the following process when running in a three-tiered environment with Development. In Production. Revoke all rights to Public other than Read for the Production folders. Test/QA. It ensures that the latest development maps can be moved over manually as they are completed. In Test and Quality Assurance. This method gives you total granular control over the objects that are being moved. and Production servers: For migrating from Development into Test. change the owner of the folders to a user in the Production group. ensure the owner of the folders is a user in the Development group. see the steps outlined in the Object Copy section. Informatica recommends using the Object Copy method. In Development.4. change the owner of the Test/QA folders to a user in the Test/QA group. • • • • Implement appropriate security. PAGE BP-12 BEST PRACTICES INFORMATICA CONFIDENTIAL .

as it allows you to copy sources. XML Object Copy Process Another method of copying objects in a distributed (or centralized) environment is to copy objects by utilizing PM/PC’s XML functionality. you would export that session to an XML file. Objects are exported into an XML file: INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-13 .). Before performing this migration. If similar server and database naming conventions are utilized. The XML Object Copy Process works in a manner very similar to the Repository Copy backup and restore method. When you imported that XML file back into your folder. there will be minimal or no changes required to sessions that are created or copied to the production server.When migrating from Test to Production. and sessions. use one of the repository copy methods. This method is more useful in the distributed environment because it allows for backup into an XML file to be moved across the network. mappings. For example. You would then change the name of the session you just pasted to be unique. targets. reusable transformations. two sessions will be created. copy that text. that XML file can be changed with a text editor to allow more flexibility. all code in the Test server should be frozen and tested. and paste that text within the XML file. (Refer to the steps outlined in the Repository Copy section for recommendations to ensure that this process is successful. if you had to copy one session many times. Then. Informatica recommends using the Repository Copy method. After the Test code is cleared for production. you could edit that file to find everything within the <Session> tag. The following demonstrates the import/export functionality: 1. Once the XML file has been created.

2. Objects are imported into a repository from the corresponding XML file: 3. Sessions can be exported and imported into the Server Manager in the same way (the corresponding mappings must exist for this to work). PAGE BP-14 BEST PRACTICES INFORMATICA CONFIDENTIAL .

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-15 .

. While the most effective use of PowerCenter depends on the specific situation. Fixed-width files are faster than delimited files because delimited files require extra parsing. name. You can then load the targets in a specific order using Target Load Ordering. This minimizes disk seeks and applies to a PAGE BP-16 BEST PRACTICES INFORMATICA CONFIDENTIAL . Q: How does source format affect performance? (i. Backup Strategies. if there is an intent to perform intricate transformations before loading to target. Description The following pages summarize some of the questions that typically arise during development and suggest potential resolutions. what is the impact of having multiple targets populated by a single map?) With PowerCenter. Q: What are some considerations when designing the mapping? (i. which allows the PowerCenter mappings to access the data in an optimized fashion by using filters and custom SQL SELECTs where appropriate. it is possible to design a mapping with multiple targets. is it more efficient to source from a flat file rather than a database?) In general. It provides answers in a number of areas. consider writing to multiple disks or file systems simultaneously. Refer to the product guides supplied with PowerCenter for additional information. The recommendation is to limit the amount of complex logic in a mapping. a flat file that is located on the server machine loads faster than a database located on the server machine. but they can also be run concurrently and make use of more system resources. However. this Best Practice addresses some questions that are commonly raised by project teams.e.e. When using multiple output files (targets). it may be advisable to first load the flat-file into a relational database. Server Administration. and Metadata. including Scheduling. Not only is it easier to debug a mapping with a limited number of objects. and document components of the analytic solution.Development FAQs Challenge Using the PowerCenter product suite to most effectively to develop.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-17 . For example. which simplifies the operations tasks associated with loading the targets. Error information also appears in the PowerCenter Help File within the PowerCenter client applications. consult your Database User Guide. or to stop on errors. It is best to use filters as early as possible in the mapping to remove rows of data that are not needed. Q: What are some considerations for determining how many objects and transformations to include in a single mapping? There are several items to consider when building a mapping. It's also possible to set up conditions to run the next session only if the previous session was successful. Sequential batches help ensure that dependencies are met as needed. Q: What documentation is available for the error codes that appear within the error log files? Log file errors and descriptions appear in Appendix C of the PowerCenter User Guide. For example. For other database-specific errors. keeping all log files in the same directory. This is the SQL equivalent of the WHERE clause. There are two types of batches: sequential and concurrent. o A sequential batch simply runs sessions one at a time.session writing to multiple targets. etc. Using the filter condition in the Source Qualifier to filter out the rows at the database level is a good way to increase the performance of the mapping. The most expensive use of the DTM is passing unnecessary data through the mapping. a sequential batch ensures that session1 runs before session2 when session2 is dependent on the load of session1. The business requirement is always the first consideration. a hundred individual sessions can be logically grouped into twenty batches. The Operations group can then work with twenty batches to load the warehouse. Scheduling Techniques Q: What are the benefits of using batches rather than sessions? Using a batch to group logical sessions minimizes the number of objects that must be managed to successfully load the warehouse. regardless of the number of objects it takes to fulfill the requirement. and so on. Log File Organization Q: Where is the best place to maintain Session Logs? One often-recommended location is the default /SessLogs/ folder in the Informatica directory. and to multiple sessions running simultaneously. in a linear sequence.

o

A concurrent batch groups logical sessions together, like a sequential batch, but runs all the sessions at one time. This can reduce the load times into the warehouse, taking advantage of hardware platforms' Symmetric Multi-Processing (SMP) architecture. A new batch is sequential by default; to make it concurrent, explicitly select the Concurrent check box.

Other batch options, such as nesting batches within batches, can further reduce the complexity of loading the warehouse. However, this capability allows for the creation of very complex and flexible batch streams without the use of a third-party scheduler. Q: Assuming a batch failure, does PowerCenter allow restart from the point of failure? Yes. When a session or sessions in a batch fail, you can perform recovery to complete the batch. The steps to take vary depending on the type of batch: If the batch is sequential, you can recover data from the session that failed and run the remaining sessions in the batch. If a session within a concurrent batch fails, but the rest of the sessions complete successfully, you can recover data from the failed session targets to complete the batch. However, if all sessions in a concurrent batch fail, you might want to truncate all targets and run the batch again. Q: What guidelines exist regarding the execution of multiple concurrent sessions / batches within or across applications? Session/Batch Execution needs to be planned around two main constraints: • • Available system resources Memory and processors

The number of sessions that can run at one time depends on the number of processors available on the server. The load manager is always running as a process. As a general rule, a session will be compute-bound, meaning its throughput is limited by the availability of CPU cycles. Most sessions are transformation intensive, so the DTM always runs. Also, some sessions require more I/O, so they use less processor time. Generally, a session needs about 120 percent of a processor for the DTM, reader, and writer in total. For concurrent sessions: • • One session per processor is about right; you can run more, but all sessions will slow slightly. Remember that other processes may also run on the PowerCenter server machine; overloading a production machine will slow overall performance.

Even after available processors are determined, it is necessary to look at overall system resource usage. Determining memory usage is more difficult

PAGE BP-18

BEST PRACTICES

INFORMATICA CONFIDENTIAL

than the processors calculation; it tends to vary according to system load and number of Informatica sessions running. The first step is to estimate memory usage, accounting for: • • • Operating system kernel and miscellaneous processes Database engine Informatica Load Manager

Each session creates three processes: the Reader, Writer, and DTM. • • If multiple sessions run concurrently, each has three processes More memory is allocated for lookups, aggregates, ranks, and heterogeneous joins in addition to the shared memory segment.

At this point, you should have a good idea of what is left for concurrent sessions. It is important to arrange the production run to maximize use of this memory. Remember to account for sessions with large memory requirements; you may be able to run only one large session, or several small sessions concurrently. Load Order Dependencies are also an important consideration because they often create additional constraints. For example, load the dimensions first, then facts. Also, some sources may only be available at specific times, some network links may become saturated if overloaded, and some target tables may need to be available to end users earlier than others. Q: Is it possible to perform two "levels" of event notification? One at the application level, and another at the PowerCenter server level to notify the Server Administrator? The application level of event notification can be accomplished through postsession e-mail. Post-session e-mail allows you to create two different messages, one to be sent upon successful completion of the session, the other to be sent if the session fails. Messages can be a simple notification of session completion or failure, or a more complex notification containing specifics about the session. You can use the following variables in the text of your post-session e-mail: E-mail Variable %s %l %r %e Description Session name Total records loaded Total records rejected Session status

INFORMATICA CONFIDENTIAL

BEST PRACTICES

PAGE BP-19

%t

Table details, including read throughput in bytes/second and write throughput in rows/second Session start time Session completion time Session elapsed time (session completion time-session start time) Attaches the session log to the message Attaches the named file. The file must be local to the Informatica Server. The following are valid filenames: %a<c:\data\sales.txt> or %a</users/john/data/sales.txt> On Windows NT, you can attach a file of any type. On UNIX, you can only attach text files. If you attach a non-text file, the send might fail. Note: The filename cannot include the Greater Than character (>) or a line break.

%b %c %i %g %a<filename>

The PowerCenter Server on UNIX uses rmail to send post-session e-mail. The repository user who starts the PowerCenter server must have the rmail tool installed in the path in order to send e-mail. To verify the rmail tool is accessible: 1. Login to the UNIX system as the PowerCenter user who starts the PowerCenter Server. 2. Type rmail <fully qualified email address> at the prompt and press Enter. 3. Type . to indicate the end of the message and press Enter. 4. You should receive a blank e-mail from the PowerCenter user's e-mail account. If not, locate the directory where rmail resides and add that directory to the path. 5. When you have verified that rmail is installed correctly, you are ready to send post-session e-mail. The output should look like the following: Session complete. Session name: sInstrTest Total Rows Loaded = 1 Total Rows Rejected = 0 Completed

PAGE BP-20

BEST PRACTICES

INFORMATICA CONFIDENTIAL

Rows Loaded Status 1

Rows Rejected 0

Read Throughput (bytes/sec) 30

Write Throughput Table Name (rows/sec) 1 t_Q3_sales

No errors encountered. Start Time: Tue Sep 14 12:26:31 1999 Completion Time: Tue Sep 14 12:26:41 1999 Elapsed time: 0:00:10 (h:m:s) This information, or a subset, can also be sent to any text pager that accepts e-mail. Backup Strategy Recommendation Q: Can individual objects within a repository be restored from the back-up or from a prior version? At the present time, individual objects cannot be restored from a back-up using the PowerCenter Server Manager (i.e., you can only restore the entire repository). But, It is possible to restore the back-up repository into a different database and then manually copy the individual objects back into the main repository. Refer to Migration Procedures for details on promoting new or changed objects between development, test, QA, and production environments. Server Administration Q: What built-in functions, does PowerCenter provide to notify someone in the event that the server goes down, or some other significant event occurs? There are no built-in functions in the server to send notification if the server goes down. However, it is possible to implement a shell script that will sense whether the server is running or not. For example, the command "pmcmd pingserver" will give a return code or status which will tell you if the server is up and running. Using the results of this command as a basis, a complex notification script could be built. Q: What system resources should be monitored? What should be considered normal or acceptable server performance levels? The pmprocs utility, which is available for UNIX systems only, shows the currently executing PowerCenter processes. Pmprocs is a script that combines the ps and ipcs commands. It is available through Informatica Technical Support. The utility provides the following information: - CPID - Creator PID (process ID)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

PAGE BP-21

you can enter description information for all repository objects.0 or 1 . this decision should be made on the basis of how much metadata will be required by the systems that use the metadata. etc. transformations.used to sync the reader and writer . expression. Use the pmserver. If this is the case. but the amount of metadata that you enter should be determined by the business requirements.shows slot in LM shared memory (See Chapter 16 in the PowerCenter Administrator's Guide for additional details. and manage their metadata in Informatica's central repository. and MicroStrategy.. it is also very time consuming to do so. Q: What procedures exist for extracting metadata from the repository? Informatica offers an extremely rich suite of metadata-driven tools for data warehousing applications. sources. PAGE BP-22 BEST PRACTICES INFORMATICA CONFIDENTIAL . All information about column size and scale. variable.) Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an Oracle instance crash? If the UNIX server crashes. You can also drill down to the column level and give descriptions of the columns in a table if necessary. views have been created to provide access to the metadata stored in the repository. Metadata Q: What recommendations or considerations exist as to naming standards or repository administration for metadata that might be extracted from the PowerCenter repository and used in others? With PowerCenter. The motivation behind the original Metadata Exchange (MX) architecture was to provide an effective and easy-to-use interface to the repository. including Brio.LPID .Semaphores . Business Objects. While it may be beneficial for a developer to enter detailed descriptions of each column. even for SELECT access. retrieve.Last PID that accessed the resource . You can also use ps -ef | grep pmserver to see if the server process (the Load Manager) is running. The decision on how much metadata to create is often driven by project timelines. etc. Rather. Informatica and several key Business Intelligence (BI) vendors. All of these tools store. you should first check to see if the Repository Database is able to come back up successfully. and primary keys are stored in the repository. then you should try to start the PowerCenter server.err log to check if the server has started correctly. Therefore. datatypes. targets. Cognos. Today. are effectively using the MX views to report and query the Informatica metadata. Informatica does not recommend accessing the repository directly.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-23 .

an effective.Provides tools that are run before the data extraction and load process to clean source data. standardization. cleansing. Available tools are : • • • DMDataFuse .a data cleansing and householding system with the power to accurately standardize and match data. data analysis system that profiles and DMValiData identifies inconsistencies between data and metadata. or incorrect. it is not unusual to discover that as many as half the records in a database contain some type of information that is incomplete. as Transformation Components. at the point of entry into the data warehouse or operational data store (ODS). enhancement. If users discover data inconsistencies.a powerful non-compiled scripting language that operates on flat ASCII or delimited files. The challenge is therefore to cleanse data online. DMUtils . However. TM . and matching of the name and address information during the PowerCenter ETL stage of building a data mart or data warehouse. PAGE BP-24 BEST PRACTICES INFORMATICA CONFIDENTIAL . TM FirstLogic – FirstLogic offers direct interfaces to PowerCenter during the extract and load process as well as providing pre-data extraction data cleansing tools like DataRight and Merge/Purge. Description Informatica has several partners in the data cleansing arena. It is primarily used as a query and reporting tool. the user community may lose faith in the entire warehouse’s data. inconsistent. these components can be invoked for parsing. using the Informatica External Procedures protocol. The online interface (ACE Library) integrates the TrueName Library and Merge/Purge Library of FirstLogic. Thus. The partners and respective tools include the following: DataMentors . to ensure that the warehouse provides consistent and accurate data for business decision making. It also provides a way to reformat and summarize files.Data Cleansing Challenge Accuracy is one of the biggest obstacles blocking the success of many data warehousing projects.

Integration Examples This following sections describe how to integrate two of the tools with PowerCenter. and unique probabilistic and fuzzy matching capabilities. Delivery of this bridge was originally scheduled for May 2001. conditioning. Informatica users can invoke Trillium’s four data quality components through an easy-to-use graphical desktop object..Paladyne – The flagship product. which identifies business relationships (such as households) and duplications. reveals undocumented business practices. Vality – Provides a product called Integrity. Datagration is an open. Datagration's Data Discovery Message Gateway feature assesses data cleansing requirements using automated data discovery tools that identify data patterns. Datagration supports relational database systems and flat files as data sources and any application that runs in batch mode. and discovers metadata/field content discrepancies. It offers data analysis and investigation. but no further information is available at this time. Trillium – Trillium’s eQuality customer information components (a web enabled tool) are integrated with Informatica’s Transformation Exchange modules and reside on the same server as Informatica’s transformation engine. elementizing and standardizing customer data Geocoder: an Internationally-certified postal and census module for address verification and standardization Matcher: a module designed for relationship matching and record linking. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-25 . The four components are : • • • • Converter: data analysis and investigation module for discovering word patterns and phrases within free form text Parser: processing engine for data cleansing.e. As a result. flexible data quality system that can repair any type of data (in addition to its name and address) by incorporating custom business rules and logic. data elements) into a logical order. Data Discovery enables Datagration to search through a field of free form data and re-arrange the tokens (i. FirstLogic – ACE The following graphic illustrates a high level flow diagram of the data cleansing process. Vality is in the process of developing a "TX Integration" to PowerCenter. words.

Use the Informatica Advanced External Transformation process to interface with the FirstLogic module by creating a “Matching Link” transformation. That process uses the Informatica Transformation Developer to create a new Advanced External Transformation, which incorporates the properties of the FirstLogic Matching Link files. Once a Matching Link transformation has been created in the Transformation Developer, users can incorporate that transformation into any of their project mappings: it's reusable from the repository. When an Informatica session starts, the transformation is initialized. The initialization sets up the address processing options, allocates memory, and opens the files for processing. This operation is only performed once. As each record is passed into the transformation it is parsed and standardized. Any output components are created and passed to the next transformation. When the session ends, the transformation is terminated. The memory is once again available and the directory files are closed. The available functions / processes are as follows. ACE Processing There are four ACE transformations available to choose from. They will parse, standardize and append address components using Firstlogic’s ACE Library. The transformation choice depends on the input record layout. A fourth transformation can provide optional components. This transformation must be attached to one of the three base transformations. The four transforms are: 1. ACE_discrete - where the input address data is presented in discrete fields 2. ACE_multiline - where the input address data is presented in multiple lines (1-6). 3. ACE_mixed - where the input data is presented with discrete city/state/zip and multiple address lines(1-6). 4. Optional transform – which is attached to one of the three base transforms and outputs the additional components of ACE for enhancement.

PAGE BP-26

BEST PRACTICES

INFORMATICA CONFIDENTIAL

All records input into the ACE transformation are returned as output. ACE returns Error/Status Code information during the processing of each address. This allows the end user to invoke additional rules before the final load is completed. TrueName Process TrueName mirrors the ACE transformation options with discrete, multi-line and mixed transformations. A fourth and optional transformation available in this process can be attached to one of the three transformations to provide genderization and match standards enhancements. TrueName will generate error and status codes. Similar to ACE, all records entered as input into the TrueName transformation can be used as output. Matching Process The matching process works through one transformation within the Informatica architecture. The input data is read into the Informatica data flow similar to a batch file. All records are read, the break groups created and, in the last step, matches are identified. Users set-up their own matching transformation through the PowerCenter Designer by creating an advanced external procedure transformation. Users are able to select which records are output from the matching transformations by editing the initialization properties of the transformation. All matching routines are predefined and, if necessary, the configuration files can be accessed for additional tuning. The five predefined matching scenarios include: individual, family, household (the only difference between household and family, is the household doesn't match on last name), firm individual, and firm. Keep in mind that the matching does not do any data parsing, this must be accomplished prior to using this transformation. As with ACE and TrueName, error and status codes are reported. Trillium Integration to Trillium’s data cleansing software is achieved through the Informatica Trillium Advanced External Procedures (AEP) interface. The AEP modules incorporate the following Trillium functional components. • Trillium Converter – The Trillium Converter facilitates data conversion such as EBCDIC to ASCII, integer to character, character length modification, literal constant and increasing values. It may also be used to create unique record identifiers, omit unwanted punctuation, or translate strings based on actual data or mask values. A user-customizable parameter file drives the conversion process. The Trillium Converter is a separate transformation that can be used standalone or in conjunction with the Trillium Parser module. Trillium Parser – The Trillium Parser identifies and/or verifies the components of free-floating or fixed field name and address data. The primary function of the Parser is to partition the input address records

INFORMATICA CONFIDENTIAL

BEST PRACTICES

PAGE BP-27

• •

into manageable components in preparation for postal and census geocoding. The parsing process is highly table- driven to allow for customization of name and address identification to specific requirements. Trillium Postal Geocoder – The Trillium Postal Geocoder matches an address database to the ZIP+4 database of the U.S. Postal Service (USPS). Trillium Census Geocoder – The Trillium Census Geocoder matches the address database to U.S. Census Bureau information.

Each record that passes through the Trillium Parser external module is first parsed and then, optionally, postal geocoded and census geocoded. The level of geocoding performed is determined by a user-definable initialization property. • Trillium Window Matcher – The Trillium Window Matcher allows the PowerCenter Server to invoke Trillium’s deduplication and house holding functionality. The Window Matcher is a flexible tool designed to compare records to determine the level of likeness between them. The result of the comparisons is considered a passed, a suspect, or a failed match depending upon the likeness of data elements in each record, as well as a scoring of their exceptions.

Input to the Trillium Window Matcher transformation is typically the sorted output of the Trillium Parser transformation. The options for sorting include: • • • Using the Informatica Aggregator transformation as a sort engine. Separate the mappings whenever a sort is required. The sort can be run as a pre/post session command between mappings. Pre/post sessions are configured in the Server Manager. Build a custom AEP Transformation to include in the mapping.

PAGE BP-28

BEST PRACTICES

INFORMATICA CONFIDENTIAL

Data Connectivity Using PowerConnect for BW Integration Server

Challenge Understanding PCISBW to load data into the SAP BW. Description PowerCenter supports SAP Business Information Warehouse (BW) as a warehouse target only. PowerCenter Integration Server for BW enables you to include SAP Business Information Warehouse targets in your data mart or data warehouse. PowerCenter uses SAP’s Business Application Program Interface (BAPI), SAP’s strategic technology for linking components into the Business Framework, to exchange metadata with BW. Key Differences of Using PowerCenter to Populate BW Instead of a RDBMS • BW uses the pull model.BW must request data from an external source system, which is PowerCenter before the source system can send data to BW. PowerCenter uses PCISBW to register with BW first, using SAP’s Remote Function Call (RFC) protocol. External source systems provide transfer structures to BW. Data is moved and transformed within BW from one or more transfer structures to a communication structure according to transfer rules. Both, transfer structures and transfer rules, must be defined in BW prior to use. Normally this is done from the BW side. An InfoCube is updated by one communication structure as defined by the update rules. Staging BAPIs (an API published and supported by SAP) is the native interface to communicate with BW. Three PowerCenter product suites use this API. PowerCenter Designer uses the Staging BAPIs to import metadata for the target transfer structures. PCISBW uses the Staging BAPIs to register with BW and receive requests to run sessions. PowerCenter Server uses the Staging BAPIs to perform metadata verification and load data into BW. Programs communicating with BW use the SAP standard saprfc.ini file to communicate with BW. The saprfc.ini file is similar to the tnsnames file in Oracle or the interface file in Sybase. The PowerCenter Designer reads metadata from BW and the PowerCenter Server writes data to BW.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

PAGE BP-29

Due to its use of the pull model. There is no concept of update or deletes through the staging BAPIs. The methods have to be chosen in BW. 3. The BW administrator or project manager should tell you the name of the external source system and the InfoSource targets.• • • • • • • BW requires that all metadata extensions be defined in the BW Administrator Workbench. The definition must be imported to Designer. BW must control all scheduling. Start the PCISBW server PAGE BP-30 BEST PRACTICES INFORMATICA CONFIDENTIAL . Loading into the ODS is the fastest since less processing is performed on the data as it is being loaded into BW.ini file Required for PowerCenter and PCISBW to connect to BW. 2. Build the BW Components Step 1: Create an External Source System Step 2: Create an InfoSource Step 3: Assign an External Source System Step 4: Activate the InfoSources Hint: You do not normally need to create an external Source System or an InfoSources. BW only supports insertion of data into BW. Install and Configure PowerCenter and PCISBW Components The PCISBW server must be installed in the same directory as the PowerCenter Server. you have four options for the data target when you execute the InfoPackage: 1) InfoCubes only. 2) ODS only 3) InfoCubes then ODS and 4) InfoCubes and ODS in parallel. An active structure is the target for PowerCenter mappings loading BW. 4. For more details on installation and configuration refer to the Installation Guide.ini on both the PowerCenter Server and the PowerCenter Client). Configure the saprfc. Key Steps To Load Data Into BW 1. You need the same saprfc. Informatica recommends installing PCISBW client tools in the same directory as the PowerCenter Client. (Lots of customers choose this option) You can update the InfoCubes later. When using TRFC method. BW invokes the PowerCenter session when the InfoPackage is scheduled to run in BW. When using IDOC. all of the processing required to move data from a transfer structure to an InfoCube (transfer structure to transfer rules to communication structure to update rules to InfoCubes) is done synchronously with the InfoPackage. On NT you can have only one PCISBW. BW supports two different methods for loading data: IDOC and TRFC (Transactional Remote Functional Call).

The Server uses Type A for verify the tables and writing into BW. go to the “Selection 3rd Party Tab and click on the “Selection Refresh” button (symbol is a recycling sign) which then prompts you for the session name. You can only start a Session from BW (Scheduler in the Administrator Workbench of BW).ini PowerCenter uses two types of entries to connect to BW through the saprfc. Pmbwserver [DEST_Entry_for_R_type] [repo_user][repo_passwd][port_for_PowerCenter_Server] Note: The & sign behind the start command doesn’t work when you start up the PCISBW in a Telnet session 5. you have to enter the session_name into BW. • Do not use Notepad to edit this file. Parameter and Connection information file .ini. Set RFC_INI environment variable for all Windows NT.ini file.Start PCISBW server only after you start PowerCenter server and before you create InfoPackage in BW. Notepad can corrupt the saprfc. Windows 2000 and Windows 95/98 machines equal with saprfc. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-31 . Before you can start a session. can use only one transfer structure for each mapping. Used by the PowerCenter Integration Server for BW. Used by PowerCenter Client and PowerCenter Server. Build mappings Import the InfoSource into PowerCenter Warehouse Designer and build a mapping using the InfoSource as a target. Create a Database connection Use DEST entry_for A_type of the saprfc. It then can receive the request from BW to run a session on PowerCenter Server. 6.ini as the connect string in the PowerCenter Server Manager 7. Load data Create a session in PowerCenter and an InfoPackage in BW. To do this. cannot partition pipelines with a BW target.ini file.Saprfc. cannot execute stored procedure in a BW target. open the Scheduler dialog box. Register the PCISBW as a RFC server at the SAP gateway so it acts as a listener. The client uses Type A for importing the transfer structure (table definition) from BW into the Designer. RFC_INI is used to locate the saprfc. Type R. Restrictions on Mappings with BW InfoSource Targets • • • • You You You You can not use BW as a lookup table.ini file: • Type A. Use the DEST_for_A_type as connect string. Specifies the BW application server. To start the session go to the last tab.

BW supports only inserts. but the PCISBW Server attempts to insert all records. PAGE BP-32 BEST PRACTICES INFORMATICA CONFIDENTIAL . In some case PCISBW will generate a file with extension *. Look for error messages there. even those marked for update or delete. It does not support updates or deletes. You cannot build update strategy in a mapping.trc in the PowerCenter Server directory.• • You cannot copy fields that are prefaced with /BIC/ from the InfoSource definition into other transformations. Error Messages PCISBW writes error messages to the screen. You can use Update Strategy transformation in a mapping.

In addition. DDMs. so that as far as PowerCenter is concerned. using TCP/IP. IMS and IDMS. as well as to relational sources. flat files. It is an agent-based piece of software infrastructure that must be installed on OS/390 or AS/400 as either a regular batch job or started task. the mainframe or AS400 data is just a regular ODBC data source. PowerCenter uses SQL to access the data – which it sees as relational tables at runtime. Description When integrated with PowerCenter. move the data at high-speed between the two platforms in either direction. such as VSAM. which can directly import the following information. without using FTP: • • • • • COBOL and PL/1 copybooks Database definitions (DBDs) for IMS Subschemas for IDMS FDTs.Data Connectivity using PowerConnect for Mainframe Challenge Accessing important. The data can also be compressed and encrypted as it is being moved. The PowerConnect client agent and listener work in tandem and. PREDICT data and ADA-CMP data for ADABAS Physical file definitions (DDS’s) for AS/400 After the above information has been imported and saved in the datamaps. without having to write complex extract programs. the PowerConnect client agent must be installed on the same machine as the PowerCenter client or server. The PowerConnect client agent and PowerCenter communicate via a thin ODBC layer. but difficult to deal with. legacy data sources residing on mainframes and AS/400 systems. called Navigator. via “datamaps”. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-33 . such as DB2. PowerConnect for Mainframe and AS400 provides fast and seamless SQL access to non-relational sources. PowerConnect for Mainframe/AS400 has a Windows design tool. ADABAS. The ODBC layer works for both Windows and UNIX.

Perform a “row test” to source the data directly from OS/390. adding the PowerConnect ODBC driver and setting up a client ODBC DSN. PAGE BP-34 BEST PRACTICES INFORMATICA CONFIDENTIAL . Create the datamap (give it a name).cfg) to change various default settings. 4. This is the logical view. Perform the mainframe or AS/400 install.g. 2. Run the import process. 1. updating the configuration file (dbmover. The datamap is stored on the mainframe. 2. Perform the Windows install. Installing PowerConnect for Mainframe/AS400 Note: Be sure to complete the Pre-Install Checklist (included at the end of this document) prior to performing the install. ADABAS MU and PE Support for REDEFINES Date/time field masking Multiple views from single data source Bad data checking Data filtering Steps for Using the Navigator If your objective is to import a COBOL copybook from OS/390. This includes entering the Windows license key. OCCURS DEPENDING ON. Ping the mainframe or AS/400 from Windows to ensure connectivity. This includes entering the mainframe or AS/400 license key and updating the configuration file (dbmover. This is the physical view. 5. Start the Listener on the mainframe or the AS/400 system. Specify the copybook name to be imported. Review and edit (if necessary) the default table created. 3. 4.Some of the key capabilities of PowerConnect for Mainframe/AS400 include: • • • • • • • • • Full EBCDIC-ASCII conversion Multiple concurrent data movements Support of all binary mainframe datatypes (e. the process is as follows: 1. A relational table is created. packed decimal) Ability to handle complex data structures. 3.cfg) to add a node entry for communication between the client and the mainframe or AS/400. such as COBOL OCCURS.

5. • • • • • • • INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-35 .ini file by adding this entry at the end of the ODBCDLL section: DETAIL=EXTODBC. 6. along with the PowerConnect ODBC DSN that was created when PowerConnect was installed. apply the PowerCenter-PowerConnect for Mainframe/AS400 ODBC EBF. The “import from database” option in Designer is needed to pull in sources from PowerConnect. adding the PowerConnect ODBC driver and setting up the server ODBC DSN. edit the powermrt. an ODBC license key is required. Access sample data in Navigator as a test.1800). modify the Tablename prefix in the Source Options to include the PowerConnect high-level qualifier (schema name). Perform the UNIX or NT install. To handle large data sources. increase the default TIMEOUT setting in the PowerConnect configuration files (dbmover. To ensure smooth integration. a database connection is required to allow the server to communicate with PowerConnect. the statement must be qualified with the PowerConnect high-level qualifier (schema name).DLL When creating sessions in the Server Manager. which was created when PowerConnect was installed.1800. before importing a source from PowerConnect for the first time. The DSN name and connect string should be the same as PowerConnect’s ODBC DSN. updating the configuration file (dbmover.cfg) to (15. This should be of type ODBC. If entering a custom SQL override in the Source Qualifier to filter PowerConnect data. Guidelines for Integrating PowerConnect for Mainframe/AS400 with PowerCenter • In Server Manager.cfg) to change various default settings. This includes entering the UNIX or NT license key. In Designer. Since the Informatica server communicates with PowerConnect via ODBC.

Queue Manager • • • Informatica connects to Queue Manager to send and receive messages. Applications can also request data using a ‘request message’ on a message queue. creates queues.Data Connectivity using PowerConnect for MQSeries Challenge Understanding how to use MQSeries Applications in PowerCenter mappings. Joiners. and controls queue operation. MQSeries enforces No Structure on the content or format of the message. and Rank transformations because they will only be performed on one queue. MQSeries Message has two components: PAGE BP-36 BEST PRACTICES INFORMATICA CONFIDENTIAL . Description MQSeries Applications communicate by sending each other messages rather than calling each other directly. Message Queue is a destination to which messages can be sent. they can run independently of one another. You must use actual server manager session to debug a queue mapping. this is defined by the application. Because no open connections are needed between systems. as opposed to a full data set. Certain considerations also necessary when using Aggregators. MQSeries Architecture MQSeries architecture has three parts: (1) Queue Manager. Not Available to PowerCenter when using MQSeries • • • No Lookup on MQSeries sources. Every message queue belongs to a Queue Manager. (2) Message Queue and (3) MQSeries Message. Queue Manager administers queues. No Debug ‘Sessions’.

MQ SQ is predefined and comes with 29 message headed fields. XML. and control syncpoint queue clean-up. Flat File) or Normalizer (COBOL) is required if the data is not in binary.this is necessary if the file is not binary. Design the mapping as if it were not using MQ Series. • Static MQ Targets – Does not load data to the message header fields. COBOL). used for binary. MQ SQ – Must be used to read data from an MQ source.. control end of file. Dynamic – Used for binary targets only and when loading data to a message header. etc.• • A header. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-37 .e. flat file. Once the code is working correctly. Loading to a Queue There are two types of MQ Targets that can be used in a mapping: Static MQ Targets and Dynamic MQ Targets. • Creating and Configuring MQSeries Sessions After you create mappings in the Designer. Flat File or Binary. Set Message Data Size – default 64. design the mapping as if it were not using MQ Series. then make all adjustments in the session when using MQ Series. (??CORRECT INTERPRETATION??) Use the target definition specific to the format of the message data (i. test by actually pulling data from the queue. If an Associated SQ is used. then add the MQ Source and Source Qualifier after the mapping logic has been tested. control incremental extraction. which contains the application data or the ‘message body. you can create and configure sessions in the Server Manager. A data component.000. Only one type of MQ Target can be used in a single mapping. XML. Use mapping parameters and variables Associated SQ – either an Associated SQ (XML. Filter Data – set filter conditions to filter messages using message header ports. You can create a session with an MQSeries mapping using the Session Wizard in the Server Manager. which contains data about the queue. MQ SQ can perform the following tasks: • • • • • Select Associated Source Qualifier . When extracting from a queue you need to use either of two Source Qualifiers: MQ Source Qualifier (MQ SQ) or Associated Source Qualifier (SQ).verbose. normal. the queue must be in a form of COBOL.’ Extraction from a Queue In order for PowerCenter to extract from a queue. You cannot use a MQ SQ to join two MQ sources. MSGID is the primary key. Set Tracing Level . Note that certain message headers in a MQSeries message require a predefined set of values assigned by IBM.

You can alternate between the two pages to set configurations for each. • And the number of rows per message(only applies to flat file MQ Targets). This indicates that the source data is coming from an MQ source. • If you load data to a dynamic MQ target. XML and COBOL source definitions. and click OK. Message Queue when there is no associated source definition in the mapping. MQSTR). • Enter the Format of the Message Data in the Target Queue (ex. • On the MQSeries page. Configuring MQSeries Targets For Static MQSeries Targets. For MQ Series sources. select File Target type from the list. Appendix Information PowerCenter uses the following datatypes in MQSeries mappings: • • IBM MQSeries datatypes. Native datatypes appear in flat file. Native datatypes. Native datatypes also appear in flat file and XML target definitions in the mapping. the Source Type is set to the following: • • Heterogeneous when there is an associated source definition in the mapping. Transformation datatypes are generic datatypes that PowerCenter uses during the transformation process. the target type is automatically set to XML. Flat file. COBOL or XML format. or COBOL datatypes associated with an MQSeries message data. • Be sure to select the MQ checkbox in Target Options for the Associated file type. click Edit Object Properties and enter: • The Connection name of the target message Queue. XML. Once this is done. They appear in all the transformations in the mapping. IBM MQSeries datatypes appear in the MQSeries source and target definitions in a mapping. When the target is an XML file or XML message data for a target message queue.Configuring MQSeries Sources MQSeries mappings cannot be partitioned if an associated source qualifier is used. Note that there are two pages on the Source Options dialog: XML and MQSeries. the target type is automatically set to Message Queue. and the message data is in flat file. select the MQ connection to use for the source message queue. • PAGE BP-38 BEST PRACTICES INFORMATICA CONFIDENTIAL . Transformation datatypes.

IBM MQSeries Datatypes MQSeries Datatypes MQBYTE MQCHAR MQLONG Transformation Datatypes BINARY STRING INTEGER INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-39 .

Installing PowerConnect for PeopleSoft Installation of PowerConnect for PeopleSoft is a multi-step process. On UNIX. Certain drivers that enable PowerCenter to extract source data from PeopleSoft systems also need to be installed. Extracts data during a session by directly running against the physical database tables using PowerCenter server. The overall process involves: Installing PowerConnect for PeopleSoft for the PowerCenter Server: • • Installation is simple like other Informatica products. PeopleSoft saves metadata in tables that provide a description and logical view of data stored in underlying physical database table. To begin. Description PowerConnect for PeopleSoft supports extraction from PeopleSoft systems. PowerConnect for PeopleSoft: • • • Imports PeopleSoft source definition metadata via PowerCenter Designer using ODBC to connect to PeopleSoft tables. both the PowerCenter Client and Server have to be set up and configured. Also. to maintain consistent. Log onto the Server machine on Windows NT/2000 or UNIX and run the setup program to select and install the PowerConnect for PeopleSoft Server. Extracts data from PeopleSoft systems without compromising existing PeopleSoft security features. make sure to set up the PATH environment variable to include current directory.Data Connectivity using PowerConnect for PeopleSoft Challenge To maintain data integrity by sourcing/targeting transactional PeopleSoft systems. reusable metadata across various systems and to understand the process for extracting data and metadata from PeopleSoft sources without having to write and sustain complex SQR extract programs. PowerConnect for PeopleSoft uses SQL to communicate with the database server. PAGE BP-40 BEST PRACTICES INFORMATICA CONFIDENTIAL .

Importing Sources PowerConnect for PeopleSoft aids data integrity by sourcing/targeting transactional PeopleSoft systems and by maintaining reusable consistent metadata across various systems. PeopleSoft Trees A PeopleSoft tree is an object that defines the groupings and hierarchical relationships between the values of a database field. the Designer imports both the PeopleSoft source name and the underlying database table name. departments 10700 and 10800 report to the same manager. It specifies how the values of a database file are grouped together for purposes of reporting or for security access. PowerConnect for PeopleSoft also imports the metadata attached to those PeopleSoft structures. While importing the PeopleSoft objects. scale and keys. PeopleSoft names the underlying database tables after the records. When you import a PeopleSoft record. Provides an alternative view of information in one or more database tables.Installing PowerConnect for PeopleSoft for the PowerCenter Client: • • Run the setup program and select PowerConnect for PeopleSoft client from the setup list. For example. the values of the DEPTID field identify individual departments in your organization. with the option to change the location. PowerConnect for PeopleSoft helps in importing from the following PeopleSoft records. Key columns contain duplicate values. For example. precision. department 20200 is INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-41 . PowerConnect for PeopleSoft extracts source data from two types of PeopleSoft objects: • • Records Trees PeopleSoft Records A PeopleSoft record is a table-like structure that contains columns with defined datatypes. For example. data for the PeopleSoft records AE_REQUEST is saved in the PS_AE_REQUEST database table. A tree defines the summarization rules for a database field. PS_Record_Name. Has one-to-one relationship with underlying physical tables. You can use the Tree Manager to define the organizational hierarchy that specifies how each department relates to the other departments. The PowerCenter Server uses the underlying database table name to extract source data. SQL view. Client installation wizard points to the PowerCenter Client directory for the driver installation as a default. • • SQL table. The Designer uses the PS source name as the name of the source definition.

and holds detail values. a query written by a certain logged in user within a group can only access the rows that are part of the records that are assigned to the group the user has access to. and each subsequent level defines a higher level grouping of the tree nodes. PowerConnect for PeopleSoft extracts data from the following PeopleSoft tree structure types: Detail Trees: In the most basic type of tree. Query access trees: are used to maintain security within the PeopleSoft implementation. Query access trees. you build a treethat mirrors the hierarchy. which organize record definitions for PeopleSoft Query security. which provide an alternative way to group nodes from an existing detail tree. the PowerCenter Server denormalizes the tree structure. In other words. There are no branches in query trees. in which database field values appear as detail values. The Departmental Security tree in PeopleSoft HRMS is a good example of a node-oriented tree. Types of Trees The Tree Manager enables you to create many kinds of trees for a variety of purposes. in which database field values appear as tree nodes. PowerConnect for PeopleSoft extracts data from loose-level and strictlevel detail trees with static detail ranges.part of a different division. the "lowest" level is the level farthest to the right in the Tree Manager window. the tree nodes represent the data values from the database field. PowerConnect for PeopleSoft extracts data from loose-level and strict level summary trees. and so on. The tree groups the nodes from a specific level in the detail tree differently from the higher levels in the detail tree itself. Node-oriented trees. which are represented as nodes on the tree. The next level is made up of tree nodes that group together the detail values. Winter Trees: Extracts data from loose-level and strict level node-oriented trees. without duplicating the entire tree structure. This kind of tree is called a detail tree. Summary Trees: In a summary tree. but children can/do exist. Winter trees contain no details ranges. Summary trees. but all trees fall into these major types: • • • • Detail trees. the detail values aren't values from a database field. It uses either of the following methods to denormalize trees. Flattening trees When you extract data from a PeopleSoft tree. PAGE BP-42 BEST PRACTICES INFORMATICA CONFIDENTIAL . PeopleSoft records are grouped into logical groups. Node Oriented trees: In a node-oriented tree. This way. but tree nodes from an existing detail tree.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-43 . Winter and Summary Trees Detail. Winter and Summary Trees Extracting Data from PeopleSoft PowerConnect for PeopleSoft extracts data from PeopleSoft systems without compromising existing PeopleSoft security To access PeopleSoft metadata and data. Flattening Method Horizontal Vertical Vertical only Tree Structure Metadata Extraction Method Import Source definition Create Source definition Create Source definition Tree Levels Strict-level tree Strict-level tree Loose-level tree Detail. use the PeopleSoft database connection names. you need to import its source definition.• • Horizontal flattening: The PowerCenter Server creates a single row for each final branch node or detail range in the tree. Create and run a session 1. Note: If PeopleSoft already establishes database connection names. You can only use horizontal flattening with strict level trees. You need a user with read access to PeopleSoft system to access the PeopleSoft physical and metadata tables via an ODBC connection. Winter and Summary Trees Detail. For example. Vertical flattening: The PowerCenter Server creates a row for each node or detail range represented in the tree. Create mapping 3. To import a PeopleSoft source definition. PowerCenter Client and Server require a database username and password. You can use the database system names for ODBC names. configure the data source to connect to the underlying database for the PeopleSoft system. Import or create source definition 2. if PeopleSoft system resides on Oracle database. When creating an ODBC data source. Importing or Creating Source Definitions Before extracting data from a source. You can use vertical flattening can be used with both strict-level and loose-level trees. Extracting data from PeopleSoft is a three-step process: 1. configure an ODBC data source to connect to the Oracle database. You can either create separate users for metadata and source extraction or alternatively use one for both. Use the Sources-Import command in PowerCenter Designer’s Source Analyzer tools to import PeopleSoft records and strict-level trees. create an ODBC data source for each PeopleSoft system you want to access.

Panels tab. Take care when using user-defined primary-foreign key relationships with trees. you connect to an ERP Source Qualifier to represent the records the PowerCenter Server queries from a PeopleSoft source. since changes made within Tree Manager may alter such relationships. Panels are referred to as Pages. An ERP Source Qualifier is used for all ERP sources like SAP. An ERP Source Qualifier like the Source Qualifier allows you to use user-defined joins and filters. enter the table owner name in the session as a source table prefix. so simply altering the primary-foreign key relationship within Source Analyzer can be dangerous and it is advisable to re-import the whole tree. Denormalization of the tables that made up the tree will be changed. select PeopleSoft as the source database type and then select a PeopleSoft database connection as source database. However. PeopleSoft etc.After you import or create a PeopleSoft record or tree. 3. and a Server Manager database connection to create a session. Note: PowerConnect for PeopleSoft works with all versions of PeopleSoft systems. If the database user is not the owner of the source tables. there are certain tables that are stored on the database without that prefix. PeopleTools Tables contain information that you define using PeopleTools. When you configure the session. When using the default join option between two PeopleSoft tables. In PeopleSoft 8. so an override and a user-defined join will need to be made to correct this. A database for a PeopleTools application contains three major sets of tables: • • • System Catalog Tables store physical attributes of tables and views. Importing Records You can import records from two tabs in the Import from PeopleSoft dialog box: • • Records tab. the Navigator displays and organizes sources by the PeopleSoft record or tree name by default. Application Data Tables house the actual data your users will enter and access through PeopleSoft application windows and panels. PeopleTools based applications are table-based systems. registered PowerCenter Server. PAGE BP-44 BEST PRACTICES INFORMATICA CONFIDENTIAL . 2. the query created will automatically append a PS_ prefix to the PeopleSoft tables. PowerConnect for PeopleSoft uses the Panels tab to import PeopleSoft 8 Pages. Create a Mapping After you import or create the source definition. Creating and Running a Session You need a valid mapping. which your database management system uses to optimize performance.

you can partition the sources to improve session performance. If you need to extract large amount of source data. PowerCenter uses SQL to extract data directly from the physical database tables. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-45 .Note: If the mapping contains a Source or ERP Qualifier with a SQL Override. performing code page translations when necessary. Note: You cannot partition an ERP Source Qualifier for PeopleSoft when it is connected to or associated with a PeopleSoft tree. the PowerCenter Server ignores the table name prefix setting for all connected sources.

This information is stored on the PowerCenter Server in a configuration file PAGE BP-46 BEST PRACTICES INFORMATICA CONFIDENTIAL . Description SAP R/3 is a software system that integrates multiple business applications. pool tables. and other applications. and run sessions to load SAP R/3 data into data warehouse. and Human Resources. SAP IDOCs and ABAP function modules. cluster tables. hierarchies(Uniform & Non Uniform). CPI-C communication protocol enables online data exchange and data conversion between R/3 system and PowerCenter . or ABAP). The database server stores the physical tables in the R/3 system. All of this is accomplished without writing ABAP code.Data Connectivity using PowerConnect for SAP Challenge Understanding how to install PowerConnect for SAP R/3. a language proprietary to SAP. such as Financial Accounting. SAP R/3 requires information such as the host name of the application server and SAP gateway. Other interfaces between the two include: • Common Program Interface-Communications (CPI-C). A transparent table definition on the application server is represented by a single physical table on the database server. Pool and cluster tables are logical definitions on the application server that do not have a one-to-one relationship with a physical table on the database server. Communication Interfaces TCP/IP is the native communication interface between PowerCenter and SAP R/3. To initialize CPI-C communication with PowerCenter. Sales and Distribution. Materials Management. The R/3 system is programmed in Advance Business Application Programming-Fourth Generation (ABAP/4. analytic applications. while the application server stores the logical tables. PowerConnect for SAP R/3 provides the ability to integrate SAP R/3 data into data warehouses. PowerConnect extracts data from transparent tables. extract data from SAP R/3. build mappings.

Extract data to file. The PowerCenter server accesses the buffers through CPI-C. and the service name and gateway on the application server. 2. Generate and install ABAP program. Remote Function Call (RFC). Import source definitions. installing ABAP program. Extraction Process R/3 source definitions can be imported from the logical tables using RFC protocol. The PowerCenter Server accesses the file through FTP or NFS mount. In the ERP Source Qualifier. The PowerCenter server uses parameters in the sideinfo file to connect to R/3 system when running the stream mode sessions. This information is stored on the PowerCenter Client and PowerCenter Server in a configuration file named saprfc. the SAP protocol for program-toprogram communication. and SAP functions to customize the ABAP program. PowerCenter makes remote function calls when importing source definitions. Extracting data from R/3 is a four-step process: 1.ini. Designer connects to the R/3 application server using RFC. Transport system. you must use an ERP Source Qualifier. There are two situations when transport system is needed: • • PowerConnect for SAP R/3 installation. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-47 . filters. Transport ABAP programs from development to production. 3. Create a mapping. You can also use joins. Two ABAP programs can be installed for each mapping: • • File mode. ABAP program variables. The transport system in SAP is a mechanism to transfer objects developed on one system to another system. and running file mode sessions. The Designer calls a function in the R/3 system to import source definitions. Extract data to buffers. Stream Mode. you can customize properties of the ABAP program that the R/3 server uses to extract source data. When creating a mapping using an R/3 source definition. ABAP code blocks. To execute remote calls from PowerCenter. RFC is the remote communication protocol used by SAP and is based on RPC (Remote Procedure Call). Note: if the ABAP programs are installed in the $TMP class then they cannot be transported from development to production. SAP R/3 requires information such as the connection type.• named sideinfo.

4. Run transport program that generate unique Ids. the PowerCenter Server accesses the file through FTP or NFS mount and continues processing the session. PowerCenter calls these objects each time it makes a request to the R/3 system. When the session runs. In stream mode. File Mode. the installed ABAP program creates buffers on the application server. With this method. • Installation and Configuration Steps For SAP R/3 The R/3 system needs development objects and user profiles established to communicate with PowerCenter. The program extracts source data and loads it into the file. When a buffer fills. When the file is complete. the session must be configured to access the file through NFS mount or FTP. PAGE BP-48 BEST PRACTICES INFORMATICA CONFIDENTIAL . When running a session in file mode. The program extracts source data and loads it into the buffers. Preparing R/3 for integration involves the following tasks: • • Transport the development objects on the PowerCenter CD to R/3. • Create and Run Session. PowerCenter Server can process data when it is received. (File or Stream mode) Stream Mode. the program streams the data to the PowerCenter Server using CPI-C. the installed ABAP program creates a file on the application server.

Configure Connections to run Sessions Configure database connections in the Server Manager to access the SAP R/3 system when running a session. ASHOST – host name of the SAP R/3 application server. it is located in /etc • • sapdp<system number> <port# of dispatcher service>/TCP sapgw<system number> <port# of gateway service>/TCP The system number and port numbers are provided by the BASIS administrator. Required Parameters for saprfc. it is located in \winnt\system32\drivers\etc On UNIX. Preparing PowerCenter for integration involves the following tasks: • • • • Run installation programs on PowerCenter Server and Client machines.ini file on the PowerCenter Client and Server allows PowerCenter to connect to the R/3 system as an RFC client.• • Establish profiles in the R/3 system for PowerCenter users. GWSERV – set to sapgw<system number> PROTOCOL – set to “I” for TCP/IP connection. Configuring the Services File On NT. Configure FTP connection to access staging file through FTP. The saprfc.ini • • • • DEST – logical name of the R/3 system TYPE – set to “A” to indicate connection to specific R/3 system. SYSNR – system number of the SAP R/3 application server. Configure the connection files: The sideinfo file on the PowerCenter Server allows PowerCenter to initiate CPI-C with the R/3 system. For PowerCenter The PowerCenter Server and Client need drivers and connection files to communicate with SAP R/3. Required Parameters for sideinfo • • • • • • DEST – logical name of the R/3 system LU – host name of the SAP application server machine TP – set to sapdp<system number> GWHOST – host name of the SAP gateway machine. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-49 . Create a development class for the ABAP programs that PowerCenter installs on the SAP R/3 system.

and running sessions. Do not use Notepad to edit saprfc. Use of outer join when two or more sources are joined in the ERP Source Qualifier. structure fields or values in the ABAP program Removal of ABAP program Information from SAP R/3 and the repository when a folder is deleted. filters. If your mapping has hierarchy definitions only. Use of static filters to reduce return rows. or dev4x transport on a 3. The installation CD includes devinit. If you use R/3 source to create target definitions in the Warehouse Designer. (MARA = MARA-MATNR = ‘189’) Customization of the ABAP program flow with joins. installing programs. 3. Use a text editor.ini file. The R/3 administration needs to create authorization.ini Set the RFC_INI environment variable. Configure the FTP connection to access staging files through FTP. production program files. you may encounter key constraint errors when you load the data warehouse. SAP functions and code blocks. The R/3 system administrator must use the transport control program tp import. 2. dev3x.Steps to Configure PowerConnect on PowerCenter 1. The transport process creates a development class called ZERP. Creation of ABAP Program variables to represent SAP R/3 structures. profiles and userids for PowerCenter users. To avoid problems extracting metadata. do not install the dev3x transport on a 4. Configure the sideinfo file. 4. • • • • • • • • PAGE BP-50 BEST PRACTICES INFORMATICA CONFIDENTIAL . Insert ABAP Code Block to add more functionality to the ABAP program flow. For example: qualifying table = table1field1 = table2-field2 where the qualifying table is the “last” table in the condition based on the join order. Install PowerConnect for SAP R/3 on PowerCenter. Be sure to note the following considerations regarding SAP R/3: You must have proper authorization on the R/3 system to perform integrated tasks. Configure the saprfc.x system. dev4x. R/3 does not always maintain referential integrity between primary key and foreign key relationship. To avoid these errors. edit the keys in the target definition before you build the physical targets. you cannot install the ABAP program. 6. to transport these objects files on the R/3 system.x system. 5. Key Capabilities of PowerConnect for SAP R/3 Some key capabilities of PowerConnect for SAP R/3 include: • • • • • • Import SAP function in the Source Analyzer. Configure the database connection to run session. Import IDOCS. such as WordPad.

The transport must need to be created manually within SAP and then transported to the Production environment Given that the development and production SAP systems are identical. If a mapping contains both hierarchies and tables. depending on which environment you’re in. You cannot use an ABAP code block. When this ABAP code is generated however. you may not want the PowerCenter Server to trim the trailing blanks. So for migration purposes. you have to add that parameter as a string value to the key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\PowerMa rt\Parameters\MiscInfo PowerCenter has the ability to generate the ABAP code for the mapping. add the flag: AllowTrailingBlanksForSAPCHAR=Yes in the pmserver.cfg If PowerCenter server is on NT/2000.• • • • • • • • • Do not use the Select Distinct option for LCHR when the length is greater than 2000 and the underlying database is Oracle. This allows you to compare R/3 data with other source data without having use the RTRIM function. To avoid trimming the trailing blanks. you must generate the ABAP program using file mode. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-51 . it treats it as VARCHAR data and trims the trailing blanks. This causes the session to fail You cannot generate and install ABAP programs from mapping shortcuts. it does not automatically create a transport for the ABAP code that it just generated. When the PowerCenter extracts CHAR data from SAP R/3. If you are upgrading and your mappings use the blanks to compare R/3 data with other data. an ABAP program variable and a source filter if the ABAP program flow contains a hierarchy and no other sources. You cannot use dynamic filters on IDOC source definitions in the ABAP program flow. you should be able to just switch your mapping to point to either the development or production instance at the session level. The PowerCenter server also trims trailing blanks for CUKY and UNIT data. SAP R/3 stores all CHAR data with trailing blanks. all you need to do is change the database connections at the session level.

Error-un/loading data– strategies for recovering. Record Indicator or Flags . PAGE BP-52 BEST PRACTICES INFORMATICA CONFIDENTIAL . making the process of loading into the warehouse without compromising its functionality increasingly difficult. Slowly changing dimensions– Informatica Wizards for generic mappings (a good start to an incremental load strategy). and unloading data. The goal is to create a load strategy that will minimize downtime for the warehouse and allow quick and robust data management. The following pages describe several possible load strategies. Data will be loaded into the warehouse based upon the last processing date or the effective date range. Date stamped data . reloading. it is important to understand the impact of a suitable incremental load strategy. Description As time windows shrink and data volumes increase. Considerations • • • • Incremental Aggregation –loading deltas into an aggregate table. In this scenario.Data is organized by timestamps. History tracking–keeping track of what has been loaded and when. Records can be selected based upon this flag to all for inserts. all records are generally inserted or updated into the data warehouse. updates and delete.Incremental Loads Challenge Data warehousing incorporates large volumes of data. The design should allow data to be incrementally added to the data warehouse with minimal impact to the overall system.Records that include columns that specify the intention of the record to be populated into the warehouse. Source Analysis Data sources typically fall into the following possible scenarios: • • • Delta Records .Records supplied by the source system include only new or changed records.

• • Key values are present . Load table log. Keep in mind the caches and indexing possibilities 3. 2. Records are directly joined to the target using Source Qualifier join conditions or using joiner transformations after the source qualifiers (for heterogeneous sources). Determine if the record exists in the target table. Record indicators. Record indicators can be beneficial when lookups into the target are not necessary. Generate a log table of records that have been already inserted into the target system. Source Based Load Strategies Complete Incremental Loads in a Single File/Table The simplest method of incremental loads is from flat files or a database in which all records will be loaded. Take care to ensure that the record exists for updates or deletes or the record can be successfully inserted. store keys in the a separate table and compare source records against this log table to determine load strategy. All values must be checked before entering the warehouse. No Key values present . lookup the keys or critical columns in the target relational database. or removed (deleted from target or filtered out and not added to the warehouse). When using joiner transformations. This particular strategy requires bulk loads into the warehouse. If it does exist. determine if the record needs to be updated. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-53 . Joins of Sources to Targets. Loading Method Data can be loaded directly from these locations into the data warehouse. You can use this table for comparison with lookups or joins. • Determine the Method of Comparison 1.When only key values are present. insert the record as a new row. inserted as a new record. keys or surrogate keys. timestamps. Lookup on target. There is no additional overhead produced in moving these sources into the warehouse. Here are some considerations: • Compare with the target table. This occurs in cases of delta loads. depending on the need and volume. data must be checked against what has already been entered into the warehouse. Identify Which Records Need to be Compared Once the sources are identified. More design effort may be needed to manage errors in these situations.Surrogate keys will be created and all data will be inserted into the warehouse based upon validity of the records. with no overhead on processing of the sources or sorting the source records. take care to ensure the data volumes are manageable. If the record does not exist. it is necessary to determine which records will be entered into the warehouse and how. Using the lookup transformation. For example.

For example. The mapping will be responsible for error control. you can also check to see if you need to update these records or discard the source record. you can use mapping variables to provide the previous date processed. Changed Data based on Keys or Record Information Data that is uniquely identified by keys can be selected based upon selection criteria. Load into Flat Files and Bulk Load using an External Loader PAGE BP-54 BEST PRACTICES INFORMATICA CONFIDENTIAL . To compare the effective dates. recovery and update strategy. The incremental load can be determined by dates greater than the previous load date or data that has an effective key greater than the last key processed. The alternative is to use control tables to store the date and update the control table after each load. Loading Method With the use of relational sources. alternate keys etc can be used to determine if they have already been entered into the data warehouse. Target Based Load Strategies Load Directly into the Target Loading directly into the target is possible when the data will be bulk loaded. refer to Best Practice: Variable and Mapping Parameters. records that contain key information such as primary keys. For detailed instruction on how to select dates. It may also be feasible to lookup in the target to see if the data exists or not. Non-relational data can be filtered as records are loaded based upon the effective dates or sequenced keys. A router transformation or a filter can be placed after the source qualifier to remove old records. Load Method It may be possible to do a join with the target tables in which new data can be selected and loaded into the target.Date Stamped Data This method involves data that has been stamped using effective dates or sequences. Views can also be created to perform the selection criteria so the processing will not have to be incorporated into the mappings. Placing the load strategy into the ETL component is much more flexible and controllable by the ETL developers and metadata. the records can be selected based on this effective date and only those records past a certain date will be loaded into the warehouse. If they exist.

state your initial value.. The source system must have a reliable date to use. in this case. In the same screen. The date must follow one of these formats: • • • • MM/DD/RR MM/DD/RR HH24:MI:SS MM/DD/YYYY MM/DD/YYYY HH24:MI:SS Step 2: Use the Mapping Variable in the Source Qualifier The select statement will look like the following: Select * from tableA Where CREATE_DATE > to_date('$$INCREMENT_DATE'. with the mapping designer open. select MAX. Using Mapping Variables and Parameter Files A mapping variable can be used to perform incremental loading. For the Aggregation option. go to the menu and select Mappings. making the mirror the active database and the active as the mirror. The mapping variable is used in the join condition in order to select only the new data that has been entered based on the create_date or the modify_date. make your variable a date/time. This method reduces the load times (with less downtime for the data warehouse) and also provide a means of maintaining a history of data being loaded into the target. This is a very important issue that everyone should understand. An external loader can be invoked at that point to bulk load the data into the target. After data has been loaded. Name the variable and. This is the date at which the load should start.The mapping will load data directly into flat files. Typically this method is only used for updates into the warehouse. 'MM-DD-YYYY HH24:MI:SS') Step 3: Use the Mapping Variable in an Expression INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-55 . Here are the steps involved in this method: Step 1: Create Mapping Variable In the Informatica Designer. the databases are switched. whichever date can be used to identify a newly inserted record. Load into a Mirror Database The data will be loaded into a mirror database to avoid down time of the active data warehouse. then select Parameters and Values.

In the expression create a variable port and use the SETMAXVARIABLE variable function and do the following: SETMAXVARIABLE($$INCREMENT_DATE.CREATE_DATE) CREATE_DATE is the date for which you would like to store the maximum value. PAGE BP-56 BEST PRACTICES INFORMATICA CONFIDENTIAL . You can view the value of the mapping variable in the session log file. You can use the variable functions in the following transformations: • • • • Expression Filter Router Update Strategy The variable constantly holds (per row) the max value between source and variable. that is the PERSISTENT value stored in the repository for the next run of your session. So if one row comes through with 9/1/2001. If all subsequent rows are LESS than that. The value of the mapping variable and incremental loading is that it allows the session to use only the new rows of data. No table is needed to store the max(date)since the variable takes care of it. then the variable gets that value. use an expression to work with the variable functions to set and use the mapping variable. After the mapping completes.For the purpose of this example. then 9/1/2001 is preserved.

. Calculate once. General Suggestions for Optimizing 1. • • Delete unnecessary links between transformations to minimize the amount of data moved. Follow these procedures and rules of thumb when creating mappings to help ensure optimization. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-57 . This is also helpful for maintenance. Description Although PowerCenter environments vary widely. use many times. and set a True/False flag. Only connect what is used. particularly in the Source Qualifier. use variables to calculate a value used several times. Calculate it once in an expression. a Source Qualifier). 2.Mapping Design Challenge Use the PowerCenter tool suite to create an efficient execution environment. Reduce the number of transformations • • There is always overhead involved in moving data between transformations. Session shared memory between 12M and 40MB should suffice. Consider more shared memory for large number of transformations. • • • Avoid calculating or testing the same value over and over. Within an expression. if you exchange transformations (e. Use mapplets to leverage the work of critical developers and minimize mistakes when performing similar functions. most sessions and/or mappings can benefit from the implementation of common objects and optimization procedures. 3.g.

048 byte row PAGE BP-58 BEST PRACTICES INFORMATICA CONFIDENTIAL . When DTM bottlenecks are identified and session optimization has not helped. Select appropriate driving/master table while using joins.000 rows or less. one for delete and one for update/insert). If the row byte count is more than 1. placing filters.000 rows.. • • • Delete unused ports particularly in Source Qualifier and Lookups.. If you use field-level stored procedures. Lookup Transformation Optimizing Tips • • When your source is large. a 2. use tracing levels to identify which transformation is causing the bottleneck (use the Test Load option in session properties). Utilize single-pass reads. • • • The engine automatically converts compatible types.024 or less. This typically improves performance by 10-20%. For any additional Source Qualifier. Use variables. Remove or reduce field-level stored procedures. 7. • • Single-pass reading is the server’s ability to use one Source Qualifier to populate multiple targets. • • 9.e. the server reads this source. aggregators as close to source as possible). Watch the data types. • • • Plan for reusable transformations upfront. 8. Sometimes conversion is excessive. Reducing the number of records used throughout the mapping provides better performance Use active transformations that reduce the number of records as early in the mapping as possible (i. 6. Only manipulate data that needs to be moved and transformed. The rule of thumb is not to cache any table over 500. 5. cache lookup table columns for those lookup tables of 500.4. The table with the lesser number of rows should be the driving/master table. the server reads the source for each Source Qualifier.. and happens on every transformation.024. Minimize data type changes between transformations by planning data flow prior to developing the mapping. This is only true if the standard row byte count is 1. then the 500k rows will have to be adjusted down as the number of bytes increase (i.e. Facilitate reuse. If you have different Source Qualifiers for the same source (e. Use mapplets to encapsulate multiple reusable transformations. PowerMart has to make a call to that stored procedure for every row so performance will be slow.g.

so the lookup table will not be cached in this case). Cache only lookup tables if the number of lookup calls is more than 10-20% of the lookup table rows. Optimize IIF expressions. Fixed-width files are faster to load than delimited files because delimited files require extra parsing.. Replace lookup with decode or IIF (for small sets of values). Operators are faster than functions (i. consider loading first to a source flat file into a relational database. less than 5. If caching lookups and performance is poor. When using a Lookup Table Transformation.. do not cache if the number of lookup table rows is big. which allows the PowerCenter mappings to access the data in an optimized fashion by using filters and custom SQL Selects where appropriate. trim spaces before comparing). 13. For small lookup tables. Avoid date comparisons in lookup.g.e. 11. improve lookup performance by placing all conditions that use the equality operator ‘=’ first in the list of conditions under the condition tab. 11.e. 15. Examine mappings via Repository Reporting. consider replacing with an unconnected. Optimize char-varchar comparisons (i. Test expression timing by replacing with constant. cache for more than 5-10 lookup calls. || vs.. Minimize aggregate function calls. Replace Aggregate Transformation object with an Expression Transformation object and an Update Strategy Transformation for certain types of Aggregations. Suggestions for Using Mapplets INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-59 . 10. replace with string. 12. Web Logs) consider using the Sorter Advanced External Procedure. • • • Operations and Expression Optimizing Tips Numeric operations are faster than string operations.000 rows. CONCAT). Use Flat Files Using flat files located on the server machine loads faster than a database located in the server machine. • • • • • • 14. For fewer number of lookup calls.• • • • can drop the cache row count to 250K – 300K. If processing intricate transformations. If working with data that is not able to return sorted data (e. uncached lookup Review complex expressions.

5 style lookup functions 5. • 7. Source data for a mapplet can originate from one of two places: • Sources within the mapplet. add. you can use it in a mapping to represent the transformations within the mapplet. Each port in an Output transformation connected to another transformation in the mapplet becomes a mapplet output port. you use an instance of the mapplet. When the server runs a session using a mapplet. create mapplet output ports. passing data through each transformation in the mapplet as designed. and configure transformations to complete the desired transformation logic. Do not reuse mapplets if you only need one or two transformations of the mapplet while all other calculated ports and transformations are obsolete 6.A mapplet is a reusable object that represents a set of transformations. When you use the mapplet in a mapping. target definitions. For example. PAGE BP-60 BEST PRACTICES INFORMATICA CONFIDENTIAL . 3. if you have several fact tables that require a series of dimension keys. normalizer. To create a mapplet. Use one or more source definitions connected to a Source Qualifier or ERP Source Qualifier transformation. it expands the mapplet. All uses of a mapplet are all tied to the ‘parent’ mapplet.or post-session stored procedures. Create a mapplet when you want to use a standardized set of transformation logic in several mappings. you can create a mapplet containing a series of Lookup transformations to find each dimension key. After you save a mapplet. A mapplet can be active or passive depending on the transformations in the mapplet. data passes through the mapplet as part of the mapping data flow. You can then use the mapplet in each fact table mapping. Hence. 1. 2. To pass data out of a mapplet. and PowerMart 3. connect. Use a mapplet Input transformation to define input ports. nonreusable sequence generator. It allows you to reuse transformation logic and can contain as many transformations as necessary. Passive mapplets only contain passive transformations. When you use the mapplet in a mapping. Being aware of this property when using mapplets can save time when debugging invalid mappings. Active mapplets contain at least one active transformation. The server then runs the session as it would any other session. all changes made to the parent mapplet logic are inherited by every ‘child’ instance of the mapplet. 4. Sources outside the mapplet. the mapplet provides source data for the mapping and is the first object in the mapping data flow. pre. When you use a mapplet in a mapping. There are several unsupported transformations that should not be used in a mapplet. joiner. these include: COBOL source definitions. rather than recreate the same lookup logic in each mapping.

You need one target in the mapping for each Output transformation in the mapplet.• • Active mapplets with more than one Output transformations. Reduce to one Output Transformation otherwise you need one target in the mapping for each Output transformation in the mapplet. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-61 . This means you cannot use only one data flow of the mapplet in a mapping. Passive mapplets with more than one Output transformations. You cannot use only one data flow of the mapplet in a mapping.

• Metadata Reporter The need for the Informatica Metadata Reporter arose from the number of clients requesting custom and complete metadata reports from their repositories. Because Informatica does not support or recommend direct reporting access to the repository. Also. expression. Therefore. even for Select only queries. The amount of metadata that is entered is dependent on the business requirements. While it may be beneficial for a developer to enter detailed descriptions of each column. Description The levels of metadata available in the Informatica tool suite are quite extensive. The Metadata Reporter allows report access to every Informatica object stored in the repository. etc. transformations. the second way of repository metadata reporting is through the use of views written using Metadata Exchange (MX). this decision should be made on the basis of how much metadata will be required by the systems that use the metadata. targets. The decision on how much metadata to create is often driven by project timelines. data types. Informatica PowerCenter contains a Metadata Reporter. • Effective with the release of version 5. all information about column size and scale. sources.Metadata Reporting and Sharing Challenge Using Informatica’s suite of metadata tools effectively in the design of the end-user analysis application. You also can drill down to the column level and give descriptions of the columns in a table if necessary. and primary keys are stored in the repository. The architecture of the Metadata Reporter is web-based. variable. Description information can be entered for all repository objects. The Metadata Reporter is a web-based application that allows you to run reports against the repository metadata.0. These views can be found in the Informatica Metadata Exchange (MX) Cookbook. Informatica offers two recommended ways for accessing the repository metadata. etc. it will also require a substantial amount of time to do so. with an Internet PAGE BP-62 BEST PRACTICES INFORMATICA CONFIDENTIAL .

The currently supported web servers are: • • • iPlanet 4. The Metadata Reporter allows you to set parameters for the metadata objects to include in the report. The reports provide information about all types of metadata objects. The name of any metadata object that displays on a report links to an associated report. (Note: You can also use the JDBC to ODBC bridge to connect to the repository.jdbc:odbc:<data_source_name>) Although the Repository Manager provides a number of Crystal Reports. even without the other Informatica Client tools being installed on that computer. the Metadata Reporter has several benefits: • The Metadata Reporter is comprehensive. The Metadata Reporter contains servlets that must be installed on a web server that runs the Java Virtual Machine and supports the Java Servlet API.) The Metadata Reporter is accessible from any computer with a browser that has access to the web server where the Metadata Reporter is installed. The Metadata Reporter allows you to go easily from one report to another.1 Jrun 2.3 with Jserv 1.3 (Note: The Metadata Reporter will not run directly on Microsoft IIS because IIS does not directly support servlets. Ex. You can run reports on any repository.3. Because the Metadata Reporter is web-based.1 or higher Apache 1. As you view a report. You do not need direct access to the repository database. you can generate reports from any machine that has access to the web server where the Metadata Reporter is installed. • • • The Metadata Reporter provides 15 standard reports that can be customized with the use of parameters and wildcards. your sources or targets or PowerMart or PowerCenter The reports in the Metadata Reporter are customizable. Syntax . You can install the Metadata Reporter on a server running either UNIX or Windows that contains a supported web server. The Metadata Reporter connects to your Informatica repository using JDBC drivers. you can generate reports for objects on which you need more information. The reports are as follows: • • • • Batch Report Executed Session Report Executed Session Report by Date Invalid Mappings Report INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-63 . The Metadata Reporter is easily accessible. Make sure the proper JDBC drivers are installed for your database platform.browser front end.

Today.• • • • • • • • • • • Job Report Lookup Table Dependency Report Mapping Report Mapplet Report Object to Mapping/Mapplet Dependency Report Session Report Shortcut Report Source Schema Report Source to Target Dependency Report Target Schema Report Transformation Report For a detailed description of how to run these reports. C++. Business Objects. Informatica currently supports the second generation of Metadata Exchange called MX2. and MicroStrategy. levels. are effectively using the MX views to report and query the Informatica metadata. The same requirement also holds for MX2. MX2 is implemented in C++ and offers an advanced object-based API for accessing and manipulating the PowerCenter Repository from various programming languages. Furthermore. the increasing popularity and use of object-oriented software tools require interfaces that can fully take advantage of the object technology. thus leading to the development of a self-contained API Software Development Kit that can be used independently of the client or server products. including Brio. Although SQL provides a powerful mechanism for accessing and manipulating records of data in a relational paradigm. The primary requirements and features of MX2 are: Incorporation of object technology in a COM-based API. it’s not suitable for procedural programming tasks that can be achieved by C. especially multidimensional models for OLAP. the requirements and objectives of MX2 supersede those of MX. consult the Metadata Reporter Guide included in your PowerCenter Documentation. Extensive metadata content. Informatica and several key vendors. or Visual Basic. such as hierarchies. Java. Although the overall motivation for creating the second generation of MX remains consistent with the original intent. A number of BI tools and upstream data warehouse modeling tools require complex multidimensional metadata. Cognos. The result was a set of relational views that encapsulated the underlying repository tables while exposing the metadata in several categories that were more suitable for external parties. One of the key advantages of MX views is that they are part of the repository database and thus could be used independent of any of the Informatica’s software products. and various relationships. Metadata Exchange: The Second Generation (MX2) The MX architecture was intended primarily for Business Intelligence (BI) vendors who wanted to create a PowerCenter-based data warehouse and then display the warehouse metadata through their own products. Self-contained Software Development Kit (SDK). PAGE BP-64 BEST PRACTICES INFORMATICA CONFIDENTIAL .

maintenance of the MX views and direct interfaces becomes a major undertaking with every major upgrade of the repository. Integration with third-party tools. The object-based technology of MX2 supports a multi-tier architecture so that a future Informatica Repository Server could be accessed from a variety of thin client programs running on different operating systems. This also facilitates robust metadata exchange with the Microsoft Repository and other software that support this repository. MX2 offers the object-based interfaces needed to develop more sophisticated procedural programs that can tightly integrate the repository with the third-party data warehouse modeling and query/reporting tools. With the advent of the Internet and distributed computing. Framework to support a component-based repository in a multi-tier architecture. Therefore. synchronizing changes and updates ensures the validity and integrity of the metadata. Ability to write (push) metadata into the repository. Informatica has worked in close cooperation with Microsoft to ensure that the logical object model of MX2 remains consistent with the data warehousing components of the Microsoft Repository. such tasks could only be accomplished by directly manipulating the repository’s relational tables. Support for Microsoft’s UML-based Open Information Model (OIM). The Microsoft Repository and its OIM schema. Synchronization of metadata based on changes from up-stream and downstream tools. Because of the limitations associated with relational views. could become a de facto general-purpose repository standard. Interoperability with other COM-based programs and repository interfaces. As a result. multi-tier architectures are becoming more widely accepted for accessing and managing metadata and data. MX2 interfaces comply with Microsoft’s Component Object Model (COM) interoperability protocol.This type of metadata was specifically designed and implemented in the repository to accommodate the needs of our partners by means of the new MX2 interfaces. As a result. MX2 alleviates this problem by offering a set of object-based APIs that are abstracted away from the details of the underlying relational tables. One of the main challenges with MX views and the interfaces that access the repository tables is that they are directly exposed to any schema changes of the underlying repository database. based on the standard Unified Modeling Language (UML). Complete encapsulation of the underlying repository organization by means of an API. The MX2 interfaces provide metadata write capabilities along with the appropriate verification and validation features to ensure the integrity of the metadata in the repository. The object-based technology used in MX2 provides the infrastructure needed to implement automatic metadata synchronization and change propagation across different tools that access the Informatica Repository. thus providing an easier mechanism for managing schema evolution. MX could not be used for writing or updating metadata in the Informatica repository. Given that metadata will reside in different databases and files in a distributed software environment. any existing or future program that is COMcompliant can seamlessly interface with the Informatica Repository by means of MX2. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-65 .

extensible API based on the standard COM protocol.MX2 Architecture MX2 provides a set of COM-based programming interfaces on top of the C++ object model used by the client tools to access and manipulate the underlying repository. but also leverages the existing C++ object model to provide an open. MX2 can be automatically installed on Windows 95. PAGE BP-66 BEST PRACTICES INFORMATICA CONFIDENTIAL . After the successful installation of MX2. This architecture not only encapsulates the physical repository structure. or Windows NT using the install program provided with its SDK. The MX2 COM APIs support the PowerCenter XML Import/Export feature and provide a COM based programming interface in which to import and export repository objects. its interfaces are automatically registered and available to any software through standard COM programming techniques. 98.

Naming Conventions Challenge Choosing a good naming standard for the repository and adhering to it. and each group works independently? • One consideration for naming conventions is how to segregate different projects and data mart objects from one another. FAQs The following paragraphs present some of the questions that typically arise in naming repositories and suggest answers: Q: What are the implications of numerous repositories or numerous folders within a repository. Q: What naming convention is recommended for Repository Folders? • Something specific (e. Choosing a convention and sticking with it is the key point . INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-67 . given that multiple development groups need to use the PowerCenter server. It is important to note that having a good naming convention will help facilitate a smooth migration and improve readability for anyone reviewing the processes. Company_Department_Project-Name_Prod) is appropriate if multiple repositories are expected for various projects and/or departments.. Description Repository Naming Conventions Although naming conventions are important for all repository and database objects. Whenever an object is shared between projects. the suggestions in this document focus on the former.g.and sometimes the most difficult in determining naming conventions. Mappings are listed in alphabetical order. the object should be stored in a shared work area so each of the individual projects can utilize a shortcut to the object.

Rank Transform: rnk_TargetTableName(s) that leverages the expression and/or a name that describes the processing being done.. etc. joiners. sessions.. The following tables illustrate some naming conventions for transformation objects (e.Note that incorporating functions in the object name makes the name more descriptive at a higher level.). lookups. Router: rtr_TARGETTABLE that leverages the expression and/or a name that describes the processing being done Group Name: Function_TargetTableName(s) (e. Expression Transform: exp_TargetTableName(s) that leverages the expression and/or a name that describes the processing being done. It is not advisable to rename an object that is currently being used in a production environment.g. Transformation Objects Naming Convention Advanced External aep_ProcedureName Procedure Transform: Aggregator Transform: agg_TargetTableName(s) that leverages the expression and/or a name that describes the processing being done. targets. External Procedure ext_ProcedureName Transform: Filter Transform: fil_TargetTableName(s) that leverages the expression and/or a name that describes the processing being done. The drawback is that when an object needs to be modified to incorporate some other business logic. seq_Function sq_SourceTable1_SourceTable2 SpStoredProcedureName UpdTargetTableName(s) that leverages the expression and/or a name that describes the procession being done Naming Convention m_TargetTable1_TargetTable2 s_MappingName bs_BatchName for a sequential batch and bc_BatchName for a concurrent batch. etc. sources.g. INSERT_EMPLOYEE or UPDATE_EMPLOYEE) nrm_TargetTableName(s) that leverages the expression and/or a name that describes the processing being done.) and repository objects (e. mappings. Joiner Transform: jnr_SourceTable/FileName1_ SourceTable/FileName2 Lookup Transform: lkp_LookupTableName Mapplet: mplt_Description Mapping Variable: $$Function or Process that is being done Mapping Parameter: $$Function or Process that is being done Normalizer Transform: nrm_TargetTableName(s) that leverages the expression and/or a name that describes the processing being done. the name no longer accurately describes the object. Normalizer Transform: Sequence Generator: Source Qualifier Transform: Stored Procedure Update Strategy Repository Objects Mapping Name: Session Name: Batch Names: PAGE BP-68 BEST PRACTICES INFORMATICA CONFIDENTIAL . Use descriptive names cautiously and at a high enough level.g.

This helps trace the port value throughout the mapping as it may travel through many other transformations. In that case. Batch init_load incr_load wkly mtly Session Postfixes Initial Load indicates this session should only be used one time to load initial data to the targets. It is a good idea to prefix generated output ports. promotion group. the port should be prefixed with “IN_”. Target Table Names There are often several instances of the same target. A prefix. you should use the prefix 'var_' plus a meaningful name. Insert. The grouping can be based on project. if a mapping has four instances of CUSTOMER_DIM table according to update strategy (Update. or some combination of these. For variables inside a transformation. Reject. When looking at a session run. etc. subject area. the port should be prefixed with the appropriate name. When you bring a source port into a lookup or expression. For example. Incremental Load is a update of the target and normally run periodically indicates a weekly run of this session / batches indicates a monthly run of this session / batches Shared Objects INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-69 .Folder Name Folder names should logically group sessions and mappings. usually because of different actions. targets should be named according to the action being executed on that target. there will be the several instances with own successful rows. failed rows. This will help the user immediately identify the ports that are being inputted without having to line up the ports with the input checkbox. such as 'b_' should be used and there should be a suffix indicating if the batch is serial or concurrent. the tables should be named as follows: • • • • CUSTOMER_DIM_UPD CUSTOMER_DIM_INS CUSTOMER_DIM_DEL CUSTOMER_DIM_REJ Port Names Ports names should remain the same as the source unless some other action is performed on the port. To make observing a session run easier. Delete). Batch Names Batch names follow basically the same rules as the session names.

If you have an object that you want to use in several mappings or across multiple folders.TableA in the repository. it is also copied. and potentially end users. TableA gets analyzed in on machine 1. Machine2 has ODBS DSN Name1 that points to database1. TableA is uniquely identified as Name0. if you are creating a session in your QA repository using connection User1_DW. do not call it dev_db01. to test. and mapplets. machine1 has ODBS DSN Name0 that points to database1. Database Connection names must be very generic to be understandable and enable a smooth migration. refrain from using environment tokens in the ODBC DSN. testers. Be careful not to include machine names or environment tokens in the Database Connection Name. you are working in. Database Connection Information A good convention for database connection information is UserName_ConnectString. ODBC database names should clearly describe the database they reference to ensure that users do not incorrectly point sessions to the wrong databases. you can place the object in a shared folder. The DBDS is the same name as the ODBC DSN since the PowerCenter Client talks to all databases through ODBC. SC_DUAL. to prod. Once the folder is shared. Using this convention will allow for easier migration if you choose to use the Copy Folder method. Also. To share objects in a folder. if you use connections with names like Dev_DW in your development repository. the users are allowed to create shortcuts to objects in the folder. The result is that the repository may refer to the same object by multiple names. they will eventually wind up in your QA. So. TableA is uniquely identified as Name1. For example. You should know which DW database. As you migrate objects from dev.Any object within a folder can be shared. the session will write to the QA DW database because you are in the QA repository. and even in your PAGE BP-70 BEST PRACTICES INFORMATICA CONFIDENTIAL . transformations. the folder must be designated as shared. Using a convention like User1_DW allows you to know who the session is logging in as and to what database. targets. like an Expression transformation that calculates sales tax. You can then use the object in other folders by creating a shortcut to the object in this case the naming convention is ‘SC_’ for instance SC_mltCREATION_SESSION. based on which repository environment. When you use Copy Folder.TableA in the repository. For example. there is a risk of analyzing the same table using different names. ODBC Data Source Names Set up all Open Database Connectivity (ODBC) data source names (DSNs) the same way on all client machines. PowerCenter uniquely identifies a source by its Database Data Source (DBDS) and its name. If the Database Connection information does not already exist in the folder you are copying to. TableA gets analyzed in on machine 2. session information is also copied. you are likely to wind up with source objects called dev_db01 in the production repository. creating confusion for developers. These objects are sources. mappings. For example. If ODBC DSNs are different across multiple machines.

when you copy a folder from Dev to QA. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-71 . if you have a User1_DW connection in each of your three environments. your sessions are ready to go into the QA repository with no manual intervention required. Instead. your sessions will automatically hook up to the connection that already exists in the QA repository. and possibly even connect strings. user names. Manual intervention would then be necessary to change connection names.Production repository as you migrate folders. passwords. Now.

and by coordinating the interaction between sessions. there are several other factors to consider when determining if a session is an ideal candidate for partitioning. (The Designer client tool is used to implement session partitioning. First. and CPUs. target type. 1. parallel execution may impair performance on over-utilized systems or systems with smaller I/O capacity. see the Partitioning Rules and Validation section of the Designer Help).Session and Data Partitioning Challenge Improving performance by identifying strategies for partitioning relational tables. Besides hardware. Continue adding partitions to the session until the desired performance threshold is met or degradation in performance is observed. XML. If there are CPU cycles PAGE BP-72 BEST PRACTICES INFORMATICA CONFIDENTIAL . Follow these three steps when partitioning your session. partitions. Parallel execution benefits systems that have the following characteristics: • Under utilized or intermittently used CPUs. the iterative process of adding partitions can begin. These considerations include source and target database setup. The column “ID” displays the percentage utilization of CPU idling during the specified interval without any I/O wait. When these factors have been considered and a partitioned strategy has been selected. COBOL and standard flat files. check the CPU usage of your machine: UNIX–type VMSTAT 1 10 on the command line.1. it may be possible to improve performance through parallel execution of the Informatica server engine. Description On hardware systems that are under-utilized. To determine if this is the case. determine if you should partition your session. However. and mapping design. These strategies take advantage of the enhanced partitioning capabilities in PowerCenter 5.

Work with the DBA to discuss the partitioning of source and target tables. The column “%IOWAIT” displays the percentage of CPU time spent idling while waiting for I/O requests. PI displays number of pages swapped in from the page space during the specified interval. For a session with n partitions. increase the memory. and note your session settings before you add each partition. Sufficient memory. if possible. If these values indicate that paging is occurring. it may be necessary to allocate more memory. For a session with n partitions. follow these steps: UNIX – type VMSTAT 1 10 on the command line. • • • Add one partition at a time.available (twenty percent or more idle time) then this session’s performance may be improved by adding a partition. If the session is paging. PO displays the number of pages swapped out to the page space during the specified interval. Check to see that you’re using as much memory as you can. To determine the I/O statistics: UNIX– type IOSTAT on the command line. you will receive a memory allocation error. You can only use • • • INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-73 . Partition tables. Set cached values for Sequence Generator. To determine if the session is paging. If too much memory is allocated to your session. The column “%idle” displays the total percentage of the time that the CPU spends idling (i. To best monitor performance. The following are selected hints for session setup. Consider Using External Loader. using an external loader may increase session performance. the unused capacity of the CPU. Sufficient I/O. add one partition at a time. this value should be at least n times the original value for the non-partitioned session. If you must set this value to a value greater than zero. The source data should be partitioned into equal sized chunks for each partition. NT – check the task manager performance tab. 2.e. make sure it is at least n times the original value for the non-partitioned session. there should be no need to use the “Number of Cached Values” property of the sequence generator.. Partition the source data evenly. and the setup of tablespaces. The next step is to set up the partition. see the Session and Server Guide for further directions on setting up partitioned sessions.) • NT – check the task manager performance tab. A notable increase in performance can also be realized when the actual source and target tables are partitioned. • NT – check the task manager performance tab. Set DTM Buffer Memory. As with any session.

the memory requirements will grow for each partition. All possible indexes are dropped or disabled on relational targets. you must make sure that DTM memory is increased to handle the lookup caches. to reduce network overhead and delay. the Informatica Server creates one memory cache for each partition and one disk cache for each transformation. The third step is to monitor the session to see if the partition is degrading or improving session performance. These conditions can help to maximize the benefits that can be achieved through partitioning. When you partition a session and there are cached lookups. add another partition. Check to see if the session is now causing the system to page. Source files are located on the same physical machine as the PMServer process when partitioning flat files. the system may start paging to disk. If the memory is not bumped up. Therefore. PAGE BP-74 BEST PRACTICES INFORMATICA CONFIDENTIAL . Target files are written to same physical machine that hosts the PMServer process. If the session performance is improved and the session meets the requirements of step 1. • • Write throughput. • • • • • • • Indexing has been implemented on the partition key when using a relational source. Oracle External Loaders are utilized whenever possible (Parallel Mode). Table Spaces and Database Partitions are properly managed on the target system. 3. Paging. Assumptions The following assumptions pertain to the source and target systems of a session that is a candidate for partitioning. Refer to the Session and Server Guide for more information on using and setting up the Oracle external loader for partitioning. in order to reduce network overhead and delay.Oracle external loaders for partitioning. Check the session statistics to see if you have increased the write throughput. causing degradation in performance. COBOL and XML. When you partition a source that uses a static lookup cache. All possible constraints are dropped or disabled on relational targets.

Similarly. are objects that can change value dynamically. and parameter files work and using them for maximum efficiency. Variables and Parameter Files Challenge Understanding how parameters. Description Prior to the release of PowerCenter 5. Expression. the only variables inherent to the product were defined to specific transformations and to those Server variables that were global in nature. Aggregator and Rank Transformations). it provides built-in parameters for use within Server Manager. global parameters defined within Server Manager would affect the subdirectories for Source Files.x has made variables and parameters available across the entire mapping rather than for a specific transformation object. precision and scale. After mapping variables are selected. initial value. etc. Target Files. Informatica added four functions to affect change to mapping variables: • • • • SetVariable SetMaxVariable SetMinVariable SetCountVariable INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-75 .g. you use the pop-up window to create a variable by specifying its name. by definition. variables.x. This is similar to creating a port in most transformations. Mapping Variables You declare mapping variables in PowerCenter Designer using the menu option Mappings -> Parameters and Variables.. Using parameter files. Variables. Log Files. these values can change from session-run to session-run. data type. aggregation type. PowerCenter 5.Using Parameters. In addition. Transformation variables were defined as variable ports in a transformation and could only be used in that specific Transformation object (e.

then a data type specific default value is used. Order of Evaluation The start value is the value of the variable at the start of the session. Is a debug session. For example. 2. Runs in debug mode and is configured to discard session output. Initial Value This value is used during the first session run when there is no corresponding and overriding parameter file. a user-defined initial value for the variable. The start value can be a value defined in the parameter file for the variable. 3. 4. with an aggregation type of Max. Value in session parameter file Value saved in the repository Initial value Default value Mapping Parameters and Variables Since parameter values do not change over the course of the session run. A typical variable name is: $$Procedure_Start_Date. the value used is based on: • • • Value in session parameter file Initial value Default value PAGE BP-76 BEST PRACTICES INFORMATICA CONFIDENTIAL . This value is also used if the stored repository value is deleted. Name The name of the variable should be descriptive and be preceded by ‘$$’ (so that it is easily identifiable as a variable). Variable values are not stored in the repository when the session: • • • • Fails to complete. Is configured for a test load. If no initial value is identified. the value stored in the repository would be the max value across ALL session runs until the value is deleted. The PowerCenter Server looks for the start value in the following order: 1. a value saved in the repository from the previous run of the session. Aggregation Type This entry creates specific functionality for the variable and determines how it stores data.A mapping variable can store the last value from a session run in the repository to be used as the starting value for the next session run. or the default value based on the variable data type.

The following parameters and variables can be defined or overridden within the parameter file: Parameter & Variable Type Parameter & Variable Name Desired Definition String Mapping Parameter $$State MA Datetime Mapping Variable $$Time 10/1/2000 00:00:00 Source File (Session $InputFile1 Sales. Parameters or variables must be defined in the mapping to be used.s_test_var1] $$PMSuccessEmailUser=XXX@informatica. or to define Server-specific values for a session run.txt Parameter) Database Connection $DBConnection_Target Sales (database (Session Parameter) connection) Session Log File (Session $PMSessionLogFile d:/session INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-77 .$$Help_User A parameter file is declared for use by a session. user-defined join.SESSION_NAME. A line can be ‘REMed’ out by placing a semicolon at the beginning. with each section defined within brackets as FOLDER. Some parameter file examples: [USER1. Parameter Files Parameter files can be used to override values of mapping variables or mapping parameters.s_m_subscriberstatus_load] $$Post_Date_Var=10/04/2001 [USER1. or as a parameter value when utilizing PMCMD command.Once defined. they are divided into session-specific sections. either within the session properties. Parameter files do not globally assign values. and source filter sections.com . at the outer-most batch a session resides in. mapping parameters and variables can be used in the Expression Editor section of the following transformations: • • • • Expression Filter Router Update Strategy Mapping parameters and variables also can be used within the Source Qualifier in the SQL query. Parameter files have a very simple and defined format. The naming is case sensitive.

txt Lookup SQL Override. From the menu create a new mapping variable named $$Post_Date with the following attributes: • • • • TYPE – Variable DATATYPE – Date/Time AGGREGATION TYPE – MAX INITIAL VALUE – 01/01/1900 PAGE BP-78 BEST PRACTICES INFORMATICA CONFIDENTIAL . an expression transformation object. Example: Variables and Parameters in an Incremental Strategy Variables and parameters can enhance incremental strategies. The environment data has an inherent Post_Date that is defined within a column named Date_Entered that can be used. Schema/Owner names within Target Objects/Session Properties. The following example uses a mapping variable. Sample Solution Create a mapping with source and target objects. and a parameter file for restarting.Parameter) Parameters and variables cannot be used in the following: • • • logs/firstrun. Process will run once every twenty-four hours. Lookup Location (Connection String). Scenario Company X wants to start with an initial load of all data but wants subsequent process runs to select only new information.

DATE_ENTERED) The function evaluates each value for DATE_ENTERED and updates the variable with the Max value to be passed forward. This is where the function for setting the variable will reside. TO DATE(--.Note that there is no need to encapsulate the INITIAL VALUE with quotation marks.'MM/DD/YYYY HH24:MI:SS') Also note that the initial value 01/01/1900 will be expanded by the PowerCenter Server to 01/01/1900 00:00:00. it is necessary to use the native RDBMS function to convert (e. In the expression code section place the following function: SETMAXVARIABLE($$Post_Date. However. An output port named Post_Date is created with data type of date/time.. The next step is to $$Post_Date and Date_Entered to an Expression transformation. use the following in the Source_Filter Attribute: DATE_ENTERED > to_Date(' $$Post_Date'. For example: DATE_ENTERED 9/1/2000 10/30/2001 9/2/2000 Resultant POST_DATE 9/1/2000 10/30/2001 10/30/2001 Consider the following with regard to the functionality: INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-79 .--)). Within the Source Qualifier Transformation. hence the need to convert the parameter to a date time.g. if this value is used within the Source Qualifier SQL.

If the intent is to store the original Date_Entered per row and not the evaluated date value. 2. The reason is that that memory will not be instantiated unless it is used in a downstream transformation object. the variable is updated in the Repository for use in the next session run. PAGE BP-80 BEST PRACTICES INFORMATICA CONFIDENTIAL .e. The following graphic shows that after the initial run. In order for the function to assign a value and ultimately store it in the repository. make the session Data Driven and add an Update Strategy after the transformation containing the SETMAXVARIABLE function. In order for the function to work correctly. Treat Rows As is set to Update in the session properties) the function will not work. the Max Date_Entered was 02/03/1998. the variable gets updated to the Max Date_Entered it encounters. To view the current value for a particular variable associated with the session. the port must be connected to a downstream object. the rows have to be marked for insert. but it must go to another Expression Transformation. In this case. only sources where Date_Entered > 02/03/1998 will be processed. based on the variable in the Source Qualifier Filter. That way the dates are processed and set in order and data is preserved. If the mapping is an update only mapping (i.1. As data flows through the mapping. Upon successful completion of the session. The next time this session is run. 3. but before the Target. It need not go to the target. The first time this mapping is run the SQL will select from the source where Date_Entered is > 01/01/1900 providing an initial load. then add an ORDER BY clause to the Source Qualifier.. right-click on the session and choose View Persistent Values.

To override the variable. There are two basic ways to accomplish this: • Create a generic parameter file.Resetting or Overriding Persistent Values To reset the persistent value to the initial value declared in the mapping. If a session run is needed for a specific date. and the parameter file need not have variables and parameters defined for every session ‘using’ the parameter file. use a parameter file. Run PMCMD for that session but declare the specific parameter file within the PMCMD command. place it on the server. • Parameter files can be declared in Session Properties under the Log & Error Handling Tab. either change. causing the Order of Evaluation to use the Initial Value declared from the mapping. uncomment or delete the variable in the parameter file. This will delete the stored value from the Repository. In this example. view the persistent value from Server Manager (see graphic above) and press Delete Values. and point all sessions to that parameter file. after the initial session is run the parameter file contents may look like: INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-81 . A session may (or may not) have a variable.

DB Instance ORC1 ORC99 HALC UGLY GORF Schema aardso environ hitme snakepit gmer Table orders orders order_done orders orders User Sam Help Hi Punch Brer Password max me Lois Judy Rabbit Each sales order table has a different name. After successful completion. the data processing date needs to be set to a specific date (for example: 04/21/2001). NULL. the variable override is ignored and the Initial Value or Stored Value is used.4) (28) PAGE BP-82 BEST PRACTICES INFORMATICA CONFIDENTIAL . in the subsequent run. but the same definition: ORDER_ID DATE_ENTERED DATE_PROMISED DATE_SHIPPED EMPLOYEE_ID CUSTOMER_ID SALES_TAX_RATE STORE_ID Sample Solution NUMBER DATE DATE DATE NUMBER NUMBER NUMBER NUMBER (28) NOT NOT NOT NOT NOT NOT NOT NOT NULL. All instances have a common table definition for sales orders. schema and login. NULL. NULL. regardless of differing environmental definitions (e. user/logins). Scenario Company X maintains five Oracle database instances.$$Post_Date= By using the semicolon. If. run another script to reset the parameter file. NULL. schemas. NULL (28) (28) (5.[Test. NULL. sees a valid variable and value and uses that value for the session run.s_Incremental] $$Post_Date=04/21/2001 Upon running the sessions. Example: Using Session and Mapping Parameters in Multiple Database Environments Reusable mappings that can source a common table definition across multiple databases. the order of evaluation looks to the parameter file first. are required in a multiple database environment. then a simple Perl Script can update the parameter file to: [Test.g. instances. NULL.s_Incremental] . but each instance has a unique instance name.

Also. Then create a Mapping Parameter named $$Source_Schema_Table with the following attributes: Note that the parameter attributes vary based on the specific environment. the strings are named according to the DB Instance name. Open the source qualifier and use the mapping parameter in the SQL Override as shown in the following graphic. Using Designer create the mapping that sources the commonly defined table.Using Server Manager. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-83 . In this example. the initial value is not required as this solution will use parameter files. create multiple connection strings.

Parmfile1. Using Server Manager.s_Incremental_SOURCE_CHANGES] $$Source_Schema_Table=aardso.orders $DBConnection_Source= ORC1 Parmfile2.Open the Expression Editor and select Generate SQL. there will be five separate parameter files.s_Incremental_SOURCE_CHANGES] PAGE BP-84 BEST PRACTICES INFORMATICA CONFIDENTIAL . In this example. Override the table names in the SQL statement with the mapping parameter. Now create the parameter file. create a session based on this mapping.txt [Test. Within the Source Database connection. The generated SQL statement will show the columns.txt [Test. drop down place the following parameter: $DBConnection_SourcePoint the target to the corresponding target and finish.

txt [Test.s_Incremental_SOURCE_CHANGES] $$Source_Schema_Table=snakepit.orders $DBConnection_Source= UGLY Parmfile5.1:4001 Test: s_Incremental_SOURCE_CHANGES:pf=’\$PMRootDir\ParmFiles\Parmfile1.s_Incremental_SOURCE_CHANGES] $$Source_Schema_Table=hitme.orders $DBConnection_Source= GORF Use PMCMD to run the five sessions in parallel.0.txt ‘ 1 1 INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-85 .txt [Test.txt [Test.orders $DBConnection_Source= ORC99 Parmfile3. The syntax for PMCMD for starting sessions is as follows: pmcmd start {user_name | %user_env_var} {password | %password_env_var} {[TCP/IP:][hostname:]portno | IPX/SPX:ipx/spx_address} [folder_name:]{session_name | batch_name}[:pf=param_file] session_flag wait_flag In this environment there would be five separate commands: pmcmd start tech_user pwd 127.0.s_Incremental_SOURCE_CHANGES] $$Source_Schema_Table= gmer.order_done $DBConnection_Source= HALC Parmfile4.$$Source_Schema_Table=environ.

txt ‘ 1 1 pmcmd start tech_user pwd 127.1:4001 Test: s_Incremental_SOURCE_CHANGES:pf=’\$PMRootDir\ParmFiles\ Parmfile4.0.1:4001 Test: s_Incremental_SOURCE_CHANGES:pf=’\$PMRootDir\ParmFiles\ Parmfile2.0.1:4001 Test: s_Incremental_SOURCE_CHANGES:pf=’\$PMRootDir\ParmFiles\ Parmfile5. a pre.txt ‘ 1 1 pmcmd start tech_user pwd 127.0.txt ‘ 1 1 pmcmd start tech_user pwd 127.pmcmd start tech_user pwd 127.0. you could run the sessions in sequence with one parameter file. In this case.0.1:4001 Test: s_Incremental_SOURCE_CHANGES:pf=’\$PMRootDir\ParmFiles\ Parmfile3.or post-session script would change the parameter file for the next session.0. PAGE BP-86 BEST PRACTICES INFORMATICA CONFIDENTIAL .txt ‘ 1 1 Alternatively.0.0.

if constraint errors are captured within INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-87 . The following questions should be considered: • • • • • • • • • • What types of errors are likely to be encountered? Of these errors. Description Identifying errors and creating an error handling strategy is an essential part of a data warehousing project. The database still enforces the foreign key constraints. Another approach is to use mappings to trap data errors. data must be checked and validated prior to entry into the data warehouse. which ones should be captured? What process can capture the possible errors? Should errors be captured before they have a chance to be written to the target database? Should bad files be used? Will any of these errors need to be reloaded or corrected? How will the users know if errors are encountered? How will the errors be stored? Should descriptions be assigned for individual errors? Can a table be designed to store captured errors and the error descriptions? Capturing data errors within a mapping and re-routing these errors to an error table allows for easy analysis for the end users and improves performance. Referential integrity is assured by including this functionality in a mapping. One strategy for handling errors is to maintain database constraints. but erroneous data will not be written to the target table. Also.A Mapping Approach to Trapping Data Errors Challenge Addressing data content errors within mappings to facilitate re-routing erroneous rows to a target other than the original target table. In the production environment. This can be accomplished by creating a lookup into a dimension table prior to loading the fact table. The first step in using mappings to trap errors is understanding and identifying the error handling requirement. For example. suppose it is necessary to identify foreign key constraint errors within a mapping.

In this example. any null value intended for a not null target PAGE BP-88 BEST PRACTICES INFORMATICA CONFIDENTIAL . and incorrect data formats or data types. the PowerCenter server will not have to write the error to the session log and the reject/bad file. null values intended for not null target fields. The ERR_DESC_TBL table will hold information about the error.the mapping. After we’ve identified the type of error. we want to capture null values before they enter into a target field that does not allow nulls. Error Handling Example In the following example. These two tables might look like the following: The error handling functionality must assigned to a unique description for each error in the rejected row. This approach can be effective for many types of data content errors. such as the mapping name. including: date conversion. The MAPPING_ID refers to the mapping name and the ROW_ID is generated by a Sequence Generator. and a description of the error. Use the Router Transformation to create a stream of data that will be the error route. Any row containing an error (or errors) will be separated from the valid data and uniquely identified with a composite key consisting of a MAPPING_ID and a ROW_ID. enabling the user to trace the error rows back to the source. Data content errors also can be captured in a mapping. the two error tables are ERR_DESC_TBL and TARGET_NAME_ERR. These two columns allow the TARGET_NAME_ERR and the ERR_DESC_TBL to be linked. In this example. the next step is to separate the error from the data flow. the ROW_ID. Mapping logic can identify data content errors and attach descriptions to the errors. This table is designed to hold all error descriptions for all mappings within the repository for reporting purposes. The composite key allows developers to trace rows written to the error tables. The TARGET_NAME_ERR table provides the user with the entire row that was rejected. The TARGET_NAME_ERR table will be an exact replica of the target table with two additional columns: ROW_ID and MAPPING_ID. Error tables are important to an error handling strategy because they store the information useful to error identification and troubleshooting.

one row of data may have as many as three errors. we need to filter the columns within the row that are actually errors. You can use the Normalizer Transformation A mapping approach to break one row of data into many rows After a single row of data is separated based on the number of possible errors in it. The following chart shows how the two error tables can be linked. we can link this row to the row in the TARGET_NAME_ERR table using the ROW_ID and the MAPPING_ID. TARGET_NAME_ERR Column1 NULL Column2 NULL Column3 NULL ROW_ID 1 MAPPING_ID DIM_LOAD ERR_DESC_TBL FOLDER_NAME MAPPING_ID ROW_ID ERROR_DESC LOAD_DATE SOURCE Target CUST DIM_LOAD 1 Column 1 is SYSDATE DIM FACT NULL CUST DIM_LOAD 1 Column 2 is SYSDATE DIM FACT NULL CUST DIM_LOAD 1 Column 3 is SYSDATE DIM FACT NULL The solution example would look like the following in a mapping: INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-89 . the row actually has only one error so we need to write only one error with its description to the ERR_DESC_TBL. Focus on the bold selections in both tables.field will generate an error message such as ‘Column1 is NULL’ or ‘Column2 is NULL’. This step can be done in an Expression Transformation. we need to break the error row into several rows. When the row is written to the ERR_DESC_TBL. For example. with each containing the same content except for a different error description. After field descriptions are assigned. but in this case.

The advantage of the mapping approach is that all errors are identified as either data errors or constraint errors and can be properly addressed. This makes error detection easy to implement and manage in a variety of mappings. A ‘hard’ error can be defined as one that would fail when being written to the database. Ultimately. Once an error type is identified. This gives business analysts an opportunity to evaluate and correct data imperfections while still allowing the records to be processed for end-user reporting. By adding another layer of complexity within the mappings. A record flagged as a hard error is written to the error route. errors can be flagged as ‘soft’ or ‘hard’. data warehouse operators can effectively communicate data quality issues to the business users. The mapping approach also reports errors based on projects or categories by identifying the mappings that contain errors.The mapping approach is effective because it takes advantage of reusable objects. By using the mapping approach to capture identified errors. is its flexibility. thereby using the same logic repeatedly within a mapplet. PAGE BP-90 BEST PRACTICES INFORMATICA CONFIDENTIAL . A ‘soft’ error can be defined as a data content error. The most important aspect of the mapping approach however. business organizations need to decide if the analysts should fix the data in the reject table or in the source systems. the error handling logic can be placed anywhere within a mapping. while a record flagged as a soft error can be written to the target system and the error tables. such as a constraint error.

and restart capabilities. Although source systems vary widely in functionality and data quality standards. it is critical to have a notification process in place. Regardless of whether an error requires manual inspection.session shell scripts for each PowerCenter session. the typical requirement of an error handling system is to address data quality issues (i. An error handling strategy should be capable of accounting for unrecoverable errors during the load process and provide crash recovery. the owner needs to know if any rows were loaded or changed during the load.e. correction of data or a rerun of the process. especially if a response is critical to the continuation of the process. dirty date). and provide a mechanism for reload. Description It important to realize the need for an error handling strategy. It also should report on the rows that are rejected by the load process.. You should prepare a high level data flow design to illustrate the load process and the role that error handling plays in it. Post-session scripts can be written to increase the functionality of the notification process to send detailed messages upon receipt of an error or file. PowerCenter includes a post-session e-mail functionality that can trigger the delivery of e-mail. Therefore.Design Error Handling Infrastructure Challenge Understanding the need for an error handling strategy. and determining an optimal plan for error handling. stop. Error handling is an integral part of any load process and directly affects the process when it starts and stops. Stop and restart processes can be managed through the preand post. provide a place to put the rejected rows. The error handling strategy should reject these rows. identifying potential errors. Implementing an error handling strategy requires a significant amount of planning and understanding of the load process. at some point a record with incorrect data will be introduced into the data warehouse from a source system. then devise an infrastructure to resolve the errors. and set a limit on how many errors can occur before the load process stops. Although error handling varies from project to project. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-91 .

Match the Hash Total and the Column Totals loaded in the target tables with the contents of the . named <TARGET_TABLE_NAME>_RELOAD with two additional columns.DAT and . rollback the data load and send notification to Production Support. then send an e-mail notification to Product Support. do a rollback of the records loaded in the target. and notification is sent to the DBA and Production Support. If they do not match. 1) E-mail 2) Page Timer to check if the load has completed by 5:00 If the load has not completed within the 2-hour AM. send an e-mail notification to Production Support and 2:00 PM Saturday for weekly loads on-call resource. 1) E-mail 2) Page If the Hash total and the total number of records do not match. 1) E-mail 2) Page Load the rejected records to a reject file and send an e-mail notification to Production Support.DAT files or . Timer to check if If the . by 5:00 AM. The two tables look like the following: PAGE BP-92 BEST PRACTICES INFORMATICA CONFIDENTIAL . window.SENT Files. 1) E-mail 2) Page Tablespace check and Database constraints check for creating Target Tables The rejected record number crosses the error threshold limit OR Informatica PowerCenter session fails for any other reason. the system load for all the loads that are part of the system are aborted.The following table presents examples of one company’s error conditions and the associated notification actions: Error Condition Notification Action Arrival of . Infrastructure Overview A better way of identifying and trapping errors is to create tables within the mapping to hold the rows that contain errors. MAPPING_NAME and SEQ_ID.SENT file do not arrive by 3:00 the files have arrived by 3:00 AM for daily loads AM. ENTERPRISE_ERR_TBL captures descriptions for all errors committed during loading. A Sample Scenario: Each target table should have an identical error table. An additional error table.SENT file. 1) E-mail If the required Tablespace is not available.

The ENTERPRISE_ERR_TBL is a target table in each mapping that requires error capturing. we can determine which values failed the lookup.The <TARGET_TABLE_NAME>_RELOAD table is target specific. we can identify that mapping DIM_LOAD with the SEQ_ID of 1 had 3 errors. By using the MAPPING_NAME and SEQ_ID. the error description states that ‘LKP1 was Invalid’. we can know that (‘test’) is the failed value in LKP1. Since rows in TARGET_RELOAD have a unique SEQ_ID. Thus. By looking at the first row in the ENTERPRISE_ERR_TBL. The entire process of defining the error handling strategy within a particular mapping depends on the type of errors that you expect to capture. <TARGET_TABLE_NAME>_RELOAD Fields: Values: LKP1 test LKP2 OCC LKP3 VAL ASOF_DT 12/21/00 SEQ_ID 1 MAPPING_NAME DIM_LOAD ENTERPRISE_ERR_TBL FOLDER_NAME Values: Project_1 Project_1 Project_1 MAPPING_NAME DIM_LOAD DIM_LOAD DIM_LOAD SEQ_ID 1 1 1 ERROR_DESC LKP1 Invalid LKP2 Invalid LKP3 Invalid LOAD_DATE SYSDATE SYSDATE SYSDATE SOURCE Target DIM DIM DIM DIM DIM DIM LKP_TBL SAL CUST DEPT The TARGET(<TARGET_TABLE_NAME>)_RELOAD captures rows of data that failed the validation tests. The following examples illustrate what is necessary for successful error handling. we can determine that the row of data in the TARGET_RELOAD table with the SEQ_ID of 1 had three errors. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-93 . By looking at the data rows stored in ENTERPRISE_ERR_TBL.

you can enter description information for all repository objects. but the Informatica mappings must be properly documented to take full advantage of this metadata. transformations.rpt). All information about column size and scale. and comments for each target table. from the mapping itself. (Open the Repository Manager.rpt). Description It is crucial to take advantage of the metadata contained in the repository in to document your Informatica mappings. Lists source column and transformation details for each mapping in each folder or repository. PowerCenter provides several ways to access the metadata contained within the repository.Documenting Mappings Using Repository Reports Challenge Documenting and reporting comments contained in each of the mapping objects. Provides target field transformation expressions. and click Reports.) You can choose from the following four reports: Mapping report (map. descriptions. Executed session report (sessions. Provides information about executed sessions (such as the number of successful rows) in a particular folder. Target table report (Trg_tbl. it is important to develop a plan for extracting this metadata.rpt). PAGE BP-94 BEST PRACTICES INFORMATICA CONFIDENTIAL . This means that comments must be included at all levels of a mapping. etc. One way of doing this is through the generic Crystal Reports that are supplied with PowerCenter. but the amount of metadata that you enter should be determined by the business requirements. You can also drill down to the column level and give descriptions of the columns in a table if necessary. datatypes. and primary keys are stored in the repository. Once the mappings and sessions contain the proper metadata. targets. Shows the source and target dependencies as well as the transformations performed in each mapping. Source and target dependencies report (S2t_dep. down to the objects and ports within the mapping. sources. These reports are accessible through the Repository Manager. With PowerCenter.rpt).

or the Metadata Reporter Guide included with the PowerCenter documentation. The Metadata Reporter allows for customized reporting of all repository information without direct access to the repository itself. In PowerCenter 5. these will not be displayed in the generic Crystal Reports. you can develop a metadata access strategy using the Metadata Reporter. consult Metadata Reporting and Sharing.1. Use Ctrl+V to paste the copy into a Word document. You will have to use the MX2 Views to access the repository. For more information on the Metadata Reporter. or create custom SQL view. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-95 .Note: If your mappings contain shortcuts. then use Alt+PrtSc to copy the active window to the clipboard. arrange the mapping in Designer so the full mapping appears on the screen. A printout of the mapping object flow is also useful for clarifying how objects are connected. To produce such a printout.

although it may not be complete. Dimensional errors cause valid factual data to be rejected because a foreign key relationship cannot be created. Data Integration Process Validation In general. the factual data will be reprocessed and PAGE BP-96 BEST PRACTICES INFORMATICA CONFIDENTIAL . The business needs to be aware of the consequences of either permitting invalid data to enter the EDW or rejecting it until it is fixed. Once the corrected rows have been loaded. and alternatives for addressing the most common types of problems. When the source system data does not meet these rules. there are three methods for handling data errors detected in the loading process: • Reject All. methods for handling data errors. These errors need to be fixed in the source systems and reloaded on a subsequent load of the EDW. Both approaches present complex issues. the process needs to handle the exceptions in an appropriate manner. This is the simplest to implement since all errors are rejected from entering the EDW when they are detected. an alternate method for identifying data errors. the use of data profiles. Description When loading data into an EDW or DM. Both dimensional and factual data are rejected when any errors are encountered. The business must decide what is acceptable and prioritize two conflicting goals: • • The need for accurate information The ability to analyze the most complete information with the understanding that errors can exist. This Best Practice describes various loading scenarios. This provides a very reliable EDW that the users can count on as being correct. Reports indicate what the errors are and how they affect the completeness of the data. the loading process must validate that the data conforms to known rules of the business.Error Handling Strategies Challenge Efficiently load data into the Enterprise Data Warehouse (EDW) and Data Mart (DM).

and determining the particular data elements to be rejected. Inserts are important for dimensions because subsequent factual data may rely on the existence of the dimension data row in order to load properly. since the rejected data can be processed through existing mappings once it has been fixed. and it would then be loaded into the data mart using the normal process. All changes that are valid are processed into the EDW to allow for the most complete picture. but the data may not support correct aggregations. Attributes provide additional descriptive information per key element. The development strategy may include removing information from the EDW. The development effort required to fix a Reject All scenario is minimal. After the data is fixed. Both the EDW and DM may contain incorrect information that can lead to incorrect decisions. data integrity is intact. and reprocessing the data. but incorrect detail numbers. Minimal additional code may need to be written since the data will only enter the EDW if it is correct. This method provides a balance between missing information and incorrect information. After the errors are corrected. with detail information being redistributed along different hierarchies. and 2) as Inserts or Updates. Rejected elements are reported as errors so that they can be fixed in the source systems and loaded on a subsequent run of the ETL process. assuming that all errors have been fixed. With Reject None. This approach involves examining each row of data. resulting in grand total numbers that are correct. Factual data can be allocated to dummy or incorrect dimension rows. reports may change. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-97 . This approach gives users a complete picture of the data without having to consider data that was not available due to it being rejected during the load process. • Reject None.loaded. • Reject Critical. This delay may cause some user dissatisfaction since the users need to take into account that the data they are looking at may not be a complete picture of the operational systems until the errors are fixed. which can be a time-consuming effort based on the delay between an error being detected and fixed. Once the EDW is fixed. Updates do not affect the data integrity as much because the factual data can usually be loaded with the existing dimensional data unless the update is to a Key Element. Key elements are required fields that maintain the data integrity of the EDW and allow for hierarchies to be summarized at different levels in the organization. restoring backup tapes for each night’s load. The problem is that the data may not be accurate. This approach requires categorizing the data in two ways: 1) as Key Elements or Attributes. a new loading process needs to correct both the EDW and DM. The development effort to fix this scenario is significant. these changes need to be loaded into the DM.

On 1/10/2000 Field 3 changes from Open 9-5 to Open 24hrs. When this error is fixed. it is difficult for the ETL process to produce a reflection of data changes since there is now a question whether to update a previous Profile or create a new one. Using Profiles Profiles are tables used to track history of dimensional data in the EDW. Profiles should occur once per change in the source systems. the correction process cannot be automated. and Field 2 changes from Black to BRed. Field 2 is finally fixed to Red. The first row on 1/1/2000 shows the original values. If a field value was invalid. but Field 2 is still invalid. while the second value is rejected and is not included in the new Profile. The effort also incorporates some tasks from the Reject None approach in that processes must be developed to fix incorrect data in the EDW and DM. it would be desirable to update the existing Profile rather than creating a new one. business management needs to understand that some information may be held out of the EDW. On 1/5/2000. this method allows the greatest amount of valid data to enter the EDW on each run of the ETL process. Profile records are created with date stamps that indicate when the change took place. and developing logic to update the EDW and flag the fields that are in error. but the logic needed to perform this UPDATE instead of an INSERT is complicated.The development effort for this method is more extensive than Reject All since it involves classifying fields as critical or non-critical. while at the same time screening out the unverifiable data fields. The following hypothetical example represents three field values in a source system. As the source systems change. The first method produces a new Profile record each time a change is detected in the source. Problems occur when two fields change in the source system and one of those fields produces an error. Informatica generally recommends using the Reject Critical strategy to maintain the accuracy of the EDW. The first value passes validation. and also that some of the information in the EDW may be at least temporarily allocated to the wrong hierarchies. Date 1/1/2000 Profile Date Field 1 Value 1/1/2000 Closed Sunday Field 2 Value Black Field 3 Value Open 9 – 5 PAGE BP-98 BEST PRACTICES INFORMATICA CONFIDENTIAL . which is invalid. Date 1/1/2000 1/5/2000 1/10/2000 1/15/2000 Field 1 Value Closed Sunday Open Sunday Open Sunday Open Sunday Field 2 Value Black BRed BRed Red Field 3 Value Open Open Open Open 9–5 9–5 24hrs 24hrs Three methods exist for handling the creation and update of Profiles: 1. which produces a new Profile record. By providing the most fine-grained analysis of errors. then the original field value is maintained. When the second field is fixed. This allows power users to analyze the EDW using either current (As-Is) or past (As-Was) views of dimensional data. Field 1 changes from Closed to Open. If a third field is changed before the second field is fixed. However. On 1/15/2000.

Date 1/1/2000 1/5/2000 1/10/2000 1/15/2000 1/15/2000 Profile Date Field 1 Value 1/1/2000 1/5/2000 1/10/2000 1/5/2000 (Update) 1/10/2000 (Update) Closed Sunday Open Sunday Open Sunday Open Sunday Open Sunday Field 2 Value Field 3 Value Black Black Black Red Red Open Open Open Open 9–5 9–5 24hrs 9-5 Open 24hrs If we try to implement a method that updates old Profiles when errors are fixed. but a new value is entered. It involves being able to determine when an error occurred and examining all Profiles generated since then and updating them appropriately. This incorrectly shows in the EDW that two changes occurred to the source information when. but then causes an update to the Profile records on 1/15/2000 to fix the Field 2 value in both. which incorrectly reflects the changes in the source system. 3. When the second field was fixed it would also be added to the existing Profile. which loses the Profile record for the change to Field 3. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-99 . a mistake was entered on the first change and should be reflected in the first Profile. we simplify the process by directly applying all changes to the source system directly to the EDW.is applied as a new change that creates a new Profile.regardless if it is a fix to a previous error -. 2. we would identify it as a previous error. Each change . when in reality a new Profile record should have been entered. in reality.Date 1/5/2000 1/10/2000 1/15/2000 Profile Date Field 1 Value 1/5/2000 1/10/2000 1/15/2000 Open Sunday Open Sunday Open Sunday Field 2 Value Black Black Red Field 3 Value Open 9 – 5 Open 24hrs Open 24hrs By applying all corrections as new Profiles in this method. The third method creates only two new Profiles. we need to create complex algorithms that handle the process correctly. we show the third field changed at the same time as the first. If an error is never fixed in the source system. we run the risk of losing Profile information. The second Profile should not have been created. If the third field changes before the second field is fixed. as in this method. And. causing an automated process to update old Profile records. even if we create the algorithms to handle these methods. The second method updates the first Profile created on 1/5/2000 until all fields are corrected on 1/15/2000. we still have an issue of determining if a value is a correction or a new value. as in this option. Date 1/1/2000 1/5/2000 1/10/2000 1/15/2000 Profile Date Field 1 Value 1/1/2000 1/5/2000 1/5/2000 (Update) 1/5/2000 (Update) Closed Sunday Open Sunday Open Sunday Open Sunday Field 2 Value Field 3 Value Black Black Black Red Open 9 – 5 Open 9 – 5 Open 24hrs Open 24hrs If we try to apply changes to the existing Profile.

However. one field for every field in the record. when the process encounters a new. The indicators can be append to existing data tables or stored in a separate table linked by the primary key. Quality indicators may be used to record several types of errors – e. If a record contains even one error. In this case. fatal errors (missing primary key value). This method only delays the As-Was analysis of the data until the correction method is determined because the current information is reflected in the new Profile. Metadata indicating that x number of invalid records were received and could not be processed may or may not be available for a general notice to be sent to the sending system. The following types of errors cannot be processed: • A source record does not contain a valid key. it is likely that individual unique records within the file are not identifiable. A data quality indicator code is included in the DQ fields corresponding to the original fields in the record where the errors were encountered. Records containing a fatal error are stored in a Rejected Record Table and associated to the original file name and record number. Then. If the file or record is illegible. These records cannot be loaded to the EDW because they lack a primary key field to be used as a unique record identifier in the EDW. Data Quality Edits Quality indicators can be used to record definitive statements regarding the quality of the data received and stored in the EDW. no tracking is possible to determine whether the invalid record has been replaced or not. In this way. another process examines the existing Profile records and corrects them as necessary. but the process of fixing old Profile records. Quality indicators can be used to: • • • show the record and field level quality associated with a given record at the time of extract identify data sources and errors encountered in specific records support the resolution of specific record error types via an update and resubmission process. wrong data type/format. While information can be provided to the source system site indicating there are file • PAGE BP-100 BEST PRACTICES INFORMATICA CONFIDENTIAL . no tracking is possible to determine whether the invalid record has been replaced or not. in the absence of a primary key. is delayed until the data is examined and an action is decided.. or invalid data value. the corrected data enters the EDW as a new Profile record. This record would be sent to a reject queue. Metadata will be saved and used to generate a notice to the sending system indicating that x number of invalid records were received and could not be processed. The file or record would be sent to a reject queue. The source file or record is illegible.Recommended Method A method exists to track old errors so that we know when a value was rejected. and potentially deleting the newly inserted record. correct value it flags it as part of the load strategy as a potential fix that should be applied to old Profile records. Once an action is decided. missing data in a required field. data quality (DQ) fields will be appended to the end of the record. due to the nature of the error.g.

the identified error type is recorded. But how often should these corrections be performed? The correction process can be as simple as updating field information to reflect actual values. capture and maintenance of quality indicators. a reference table is used for this validation. “1”-Fatal Error. these indicators provide the information necessary to identify acute data quality problems. business process problems and information technology breakdowns. the EDW will not be synchronized with the source systems. Typically. we cannot rule this out as a possible solution. The business needs to decide whether analysts should be allowed to fix data in the reject tables. Source System As errors are encountered. or whether data fixes will be restricted to source systems. but they contain errors: • • • A required (non-key) field is missing. systemic issues. or as complex as deleting data from the EDW. “2”-Missing Data from a Required Field. Reject Tables vs. the records can be processed. The value in a numeric or date field is non-numeric. In these error types. Although we try to avoid performing a complete database restore and reload from a previous point in time. they are written to a reject file so that business analysts can examine reports of the data and the related error messages indicating the causes of error. restoring previous loads from tape.errors for x number of records. “4”-Invalid Data Value and “5”Outdated Reference Table in Use. and then reloading the information correctly. specific problems may not be identifiable on a record-by-record basis. The quality indicators: “0”-No Error. implementation. these indicators provide the level of detail necessary for acute quality problems to be remedied in a timely manner. This can present credibility problems when trying to track INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-101 . These are used to indicate the quality of incoming data at an elemental level. When an error is detected during ingest and cleansing. If errors are fixed in the reject tables. At the same time. data quality analysts and users to readily identify issues potentially impacting the quality of the data. “3”-Wrong Data Type/Format. apply a concise indication of the quality of the data within specific fields for every data type. These indicators provide the opportunity for operations staff. Aggregated and analyzed over time. The value in a field does not fall within the range of acceptable values identified for the field. Handling Data Errors The need to periodically correct data in the EDW is inevitable. Quality Indicators (Quality Code Table) The requirement to validate virtually every data element received from the source data systems mandates the development.

a location number is assigned and the new location is transferred to the EDW using the normal process. The process assumes that the change in the primary key is actually a new warehouse and that the old warehouse was deleted. which means “undefined” in the EDW. it is corrected in the EDW. we use the ‘Unknown’ value.g. When attribute errors are encountered for a new dimensional value. Integrating the two rows involves combining the Profile PAGE BP-102 BEST PRACTICES INFORMATICA CONFIDENTIAL . Some rules that have been proposed for handling defaults are as follows: Value Types Reference Values Small Value Sets Other Description Default Attributes that are foreign keys to other Unknown tables Y/N indicator fields No Any other type of attribute Null or Business provided value Reference tables are used to normalize the EDW model to prevent the duplication of data. then these fixes must be applied correctly to the EDW. An analyst would be unable to get a complete picture. In many cases. Attribute errors can be fixed by waiting for the source system to be corrected and reapplied to the data in the EDW. When a source value does not translate into a reference table value. After a source system value is corrected and passes validation. the data integration process is set to populate ‘Null’ into these fields. (e. This type of error causes a separation of fact data. with some data being attributed to the old primary key and some to the new.) The business should provide default values for each identified attribute. Attributes include things like the color of a product or the address of a store. we use the value that represents off or ‘No’ as the default.the history of changes in the EDW and DM. then the location number is changed due to some source business rule such as: all Warehouses should be in the 5000 range. Primary Key Errors The business also needs to decide how to handle new dimensional values such as locations. the attributes are most useful as qualifiers and filtering criteria for drilling into the data. When errors are encountered in translating these values. Attribute Errors and Default Values Attributes provide additional descriptive information about a dimension concept. default values can be assigned to let the new record enter the EDW. Fields that are restricted to a limited domain of values (e. along with the related facts. These types of errors do not generally affect the aggregated facts and statistics in the EDW. Fixing this type of error involves integrating the two records in the EDW.g. are handled on a case-by-case basis. Problems occur when the new key is actually an update to an old key in the source system. like numbers. If all fixes occur in the source systems. are referred to as small value sets. Attribute errors are typically things like an invalid color or inappropriate characters in the address. On/Off or Yes/No indicators). Other values. For example. (All reference tables contain a value of ‘Unknown’ for this purpose. to find specific patterns for market research).

Reference data and translation tables enable the EDW to maintain consistent descriptions across multiple source systems.. then a manual decision is required as to which is correct. etc.information. Fact Errors If there are no business rules that reject fact records except for relationship errors to dimensional data. Each table contains a short code value as a primary key and a long description for reporting purposes. New entities in dimensional data include new locations. regardless of how the source system stores the data. From a data accuracy view. we need to create processes that update the DM after the dimensional data is fixed. creating new entities in dimensional data. Initial and periodic analyses should be performed on the errors to determine why they are not being loaded. it is necessary to restore the source information for both dimensions and facts from the point in time at which the error was introduced. If we let the facts enter the EDW and subsequently the DM. Data Stewards Data Stewards are generally responsible for maintaining reference tables and translation tables. the process to fix the DM can be time consuming and difficult to implement. If two Profile records exist for the same day. two primary keys mapped to the same EDW ID really represent two different IDs). If we load the facts with the incorrect data. we would like to reject the fact until the value is corrected. the fix process becomes simpler. the affected rows can simply be loaded and applied to the DM. After the errors are fixed. A translation table is associated with each reference table to map the INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-103 . This nightly reprocessing continues until the data successfully enters the EDW. This involves updating the measures in the DM to reflect the changed data. The situation is more complicated when the opposite condition occurs (i. they are populated into the DM as usual. then the related fact rows must be added together and the originals deleted in order to correct the data. Reference Tables The EDW uses reference tables to maintain consistent descriptions. we must decide how to handle the facts. but used as measures residing on the fact records in the DM. products. taking care to coordinate the effective dates of the Profiles to sequence properly.e. If we reject the facts when these types of errors are encountered. then when we encounter errors that would cause a fact to be rejected. If facts were loaded using both primary keys. hierarchies. After they are loaded. DM Facts Calculated from EDW Dimensions If information is captured as dimensional data from the source. and designating one primary data source when multiple sources exist. deleting affected records from the EDW and reloading from the restore to correct the errors. we save these rows to a reject table for reprocessing the following night. In this case. Multiple source data occurs when two source systems can contain different data for the same dimensional entity.

The ETL process uses the Reference table to populate the following values into the DM: Code Translation OFFICE STORE WAREHSE Code Description Office Retail Store Distribution Warehouse Error handling results when the data steward enters incorrect information for these mappings and needs to correct them after data has been loaded.g.codes to the source system values. Processes should be built to handle these types of situations. These translation tables map the source system value to the EDW value. The only way to determine which rows should be changed is to restore and reload source data from the first time the mistake was entered. Using both of these tables. Other source systems that maintain a similar field may use a two-letter abbreviation like ‘OF’. New entities in the EDW may include Locations and Products. include correction of the EDW and DM.. Dimensional Data New entities in dimensional data present a more complex issue. Correcting the above example could be complex (e. For example. products may have multiple source system values that map to the same product in the EDW. ‘S’ or ‘W’. if the data steward entered ST as translating to OFFICE by mistake). but Products serves as a good example for error handling. The data steward would make the following entries into the translation table to maintain consistency across systems: Source Value OF ST WH Code Translation OFFICE STORE WAREHSE The data stewards are also responsible for maintaining the Reference table that translates the Codes into descriptions. the ETL process can load data from the source systems into the EDW and then load from the EDW into the DM.) PAGE BP-104 BEST PRACTICES INFORMATICA CONFIDENTIAL . at a minimum. Dimensional data uses the same concept of translation as Reference tables. (Other similar translation issues may also exist. The data steward would be responsible for entering in the Translation table the following values: Source Value O S W Code Translation OFFICE STORE WAREHSE These values are used by the data integration process to correctly load the EDW. ‘ST’ and ‘WH’. For location. The translation tables contain one or more rows for each source value and map the value to a matching row in the reference table. this is straightforward. the SOURCE column in FILE X on System X can contain ‘O’. but over time.

the fact rows for the various SKU numbers need to be merged and the original rows deleted.. requiring manual intervention. A potential solution to this issue is to generate an e-mail each night if there are any translation table entries pending verification. which produces an inaccurate view of the product when reporting. Multiple Sources The data stewards are also involved when multiple sources exist for the same data. it creates a new Profile indicating the INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-105 . The first option requires the data steward to create the translation for new entities. The data steward then opens a report that lists them. This occurs when two sources contain subsets of the required information. Because they share Store information. For example. If we update the shared information on only one source system. but marks the record as ‘Pending Verification’ until the data steward reviews it and changes the status to ‘Verified’ before any facts that reference it can be loaded. A method needs to be established for manually entering fixed data and applying it correctly to the EDW. A problem specific to Product is that when it is created as new. This causes additional fact rows to be created. Facts should be split to allocate the information correctly and dimensions split to generate correct Profile information. or create the translation data through the ETL process and force the data steward to review it. Profiles would also have to be merged. it is necessary to restore the source information for all loads since the error was introduced. Affected records from the EDW should be deleted and then reloaded from the restore to correctly split the data. both sources have the ability to update the same row in the EDW.e.There are two possible methods for loading new dimensional entities. If both sources are allowed to update the shared information. Either require the data steward to enter the translation data before allowing the dimensional data into the EDW. one system may contain Warehouse and Store information while another contains Store and Hub information. a log of these fixes should be maintained to enable identifying the source of the fixes as manual rather than part of the normal load process. This requires the data stewards to review the status of new values on a daily basis. When this is fixed. including beginning and ending effective dates. it is really just a changed SKU number. and subsequently to the DM. If the changed system is loaded into the EDW. facts may be rejected or allocated to dummy values. but really represent two different products). When this happens. while the second lets the ETL process create the translation. Further. the two systems then contain different information. it is difficult to decide which source contains the correct information. These dates are useful for both Profile and Date Event fixes. The situation is more complicated when the opposite condition occurs (i. two products are mapped to the same product. data accuracy and Profile problems are likely to occur. any system is likely to encounter errors that are not correctable using source systems. In this case. When the dimensional value is left as ‘Pending Verification’ however. Manual Updates Over time.

Another solution is to indicate. While this sounds simple. it requires complex logic when creating Profiles. To avoid this type of situation. Then. If the two systems remain different. at a field level. at the field level. this requires additional effort by the data stewards to mark the correct source fields as primary and by the data integration team to customize the load process. Developers can use the field level information to update only the fields that are marked as primary. However. the source that should be considered primary for the field. the business analysts and developers need to designate. When the second system is loaded. knowing that there are no conflicts for multiple sources. because multiple sources can provide information toward the one Profile record created for that day. assumes a change occurred and creates another new Profile with the old. PAGE BP-106 BEST PRACTICES INFORMATICA CONFIDENTIAL . the process causes two Profiles to be loaded every day until the two source systems are synchronized with the same information. This allows developers to pull the information from the system of record.information changed. One solution to this problem is to develop a system of record for all sources. unchanged value. only if the field changes on the primary source would it be changed. a primary source where information can be shared from multiple sources. it compares its old unchanged value to the new Profile.

right click on the folder name. go into customize toolbars under the tools menu. double click on it’s the window’s title bar. press <Alt> <T> then <C>. Set the Key Type value to “NOT A KEY” prior to dragging. be sure to start in the Foreign Key table and drag the key/field to the Primary Key table. To delete customized icons. use an icon in the toolbar rather than a command from a drop down menu. To start the Debugger. From here you can either add new icons to your toolbar by “dragging and dropping” them from the toolbar menu or you can “drag and drop” an icon from the current toolbar if you no longer want to use it. To quickly select multiple transformations. Be sure the box touches every object you want to select. press and hold <Ctrl> and highlight the mapping with the left mouse button. To use Create Customized Toolbars to tailor a toolbar for the functions you commonly perform. then scroll down and click on “open”. use multiple fields/ports selection to copy or link. To copy a mapping from a shared folder. press <F9>. click on an Open folder icon (rather than double-clicking on it).Using Shortcut Keys in PowerCenter Designer Challenge Using shortcut keys in PowerCenter Designer to edit repository objects. • • • • • • • • INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-107 . without holding <Ctrl> creates a Shortcut to an object. To expedite mapping development. When using the "drag & drop" approach to create Foreign Key/Primary Key relationships between tables. Alternatively. If possible. The same action. hold the mouse down and drag to view a box. To use a Docking\UnDocking window such as Repository Navigator. Description General Suggestions • • To Open a folder with workspace open as well. then drag and drop into another folder or mapping and click OK.

press <Ctrl><C>. To past a selected item from the Clipboard to the grid. press <Alt><P> . To move the current field in a transformation Down. To paste a selected row from the grid. The box must be highlighted in order to check/uncheck the port type. To cancel an edit in the grid. To copy a selected item in the grid. press <F2. When adding a new port. first highlight an existing field or port. press <Ctrl><V>. PAGE BP-108 BEST PRACTICES INFORMATICA CONFIDENTIAL . To copy a selected row from the grid. To validate the Default value. use the Functions and Ports Tab.. To delete a selected field or port from the grid. For all combo/dropdown list boxes. just begin typing. When moving about the expression fields via arrow keys: o Use the SPACE bar to check/uncheck the port type. then press <Alt><w> and click OK. press <Alt><O>. then press <Alt><v> and click OK). To move the current field in a transformation Up. then click OK when you have finished. then move the cursor to the character you want to edit and click OK.Edit Tables/Transformation • • • • • • • To edit any cell in the grid. simply press OK to initiate the parsing/validation of the expression. just type the first letter on the list to select the item you want. o Press <F2> then <F3> to quickly open the Expression Editor of an OUT/VAR port. To add a new field or port. The expression must be highlighted. press <Alt><C>. first highlight it. first highlight the port you want to validate. To select PowerCenter functions and ports during expression creation. • • • • • • • Expression Editor • • To expedite the validation of a newly created expression. first highlight it. press <Esc> then click OK. then press <Alt><f> to insert the new field/port below it and click OK. then press <Alt><u> and click OK. You don't need to press DEL first to remove the ‘NEWFIELD’ text. then press OK once again in the “Expression parsed successfully” pop-up.

carefully consider whether the effort to create. expression. For example. In PowerCenter. filters.) or even a string of transformations (mapplets).g. and if all occurrences would be updated following any change or fix – then this would be an ideal case for a reusable object. etc. Description Reusable Objects The first step in creating an inventory of reusable objects is to review the business requirements and look for any common routines/modules that may appear in more than one data movement. reusable objects can be single transformations (lookups. it is simpler to add the calculation to both mappings. However. filter.Creating Inventories of Reusable Objects & Mappings Challenge Successfully creating inventories of reusable objects and mappings. test. creating and testing the object does not save development time or future maintenance.. including identifying potential economies of scale in loading multiple sources to the same target. if the calculation were to be performed in a number of mappings. The second criterion for a reusable object concerns the data that will pass through the reusable object. if there is a simple calculation like subtracting a current rate from a budget rate that will be used for two different mappings. Many times developers see a situation where they may perform a certain type of high-level process (e. Evaluate potential reusable objects by two criteria: • • Is there enough usage and complexity to warrant the development of a common object? Are the data types of the information passing through the reusable object the same from case to case or is it simply the same high-level steps with different fields and data? Common objects are sometimes created just for the sake of creating common components when in reality. Often. update strategy) in two INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-109 . and document the common object is worthwhile. if it was very difficult. These common routines are excellent candidates for reusable objects.

However. the challenge is to think in individual components of data movement. at first look. A more comprehensive approach to creating the inventory of mappings is to create a spreadsheet listing all of the target tables. Typically. The Table would look similar to the following: PAGE BP-110 BEST PRACTICES INFORMATICA CONFIDENTIAL . Create a column with a number next to each Target table. the actual object will be replicated in one to many mappings. The detailed design will occur in a future subtask. However.e. these efficiencies can be overlooked.. the developers may realize that the actual data or ports passing through the high level logic are totally different from case to case. Document the list of the reusable objects that pass this criteria test. providing a high-level description of what each object will accomplish. a single source table would populate a single target table. For each of the target tables. and list the additional source(s) of data. one for each of the dimension tables and two for the fact table. The latter is especially true for mainframe data sources where COBOL OCCURS statements litter the landscape. The remainder will be discovered while building the data integration processes. the focus is on the target tables. Consider whether there is a practical way to generalize the common logic. list the source file or table that will be used to populate the table.or more mappings. the goal here is to create an inventory of as many as possible. this is usually not the case. The goal here is to create an inventory of the mappings needed for the project. when creating an inventory of mappings. While often true. thus making the use of a mapplet impractical. the same size and number of ports must pass into and out of the mapping/reusable object. in practice. but at this point the intent is to identify the number and functionality of reusable objects that will be built for the project. By simply focusing on the target tables. if a single source of data populates multiple tables. In a typical warehouse or data mart model. and sometimes a single source of data creates many target tables. create two rows for the target each with the same number. and hopefully the most difficult ones. when creating a reusable object. in each mapping using the mapplet or reusable transformation object. Thus. this seems like a great candidate for a mapplet. so that it can be successfully applied to multiple cases. For this exercise. In the case of multiple source tables per target. Efficiencies can sometimes be realized by loading multiple tables from a single source. five mappings may be needed to populate the corresponding star schema with data (i. Mappings A mapping is an individual movement of data from a source system to a target system. with an assumption that each target table has its own mapping. Sometimes multiple sources of data need to be combined to create a target table. this approach yields multiple mappings. each from a different source system). Remember. or sometimes multiple mappings. however. after performing half of the mapplet work. While the business may consider a fact table and its three related dimensions as a single ‘object’ in the data mart or warehouse. In a simple world. each OCCURS statement decomposes to a separate table. in another column. Keep in mind that it will be impossible to identify 100 percent of the reusable objects at this point.

In this example. or Low number of target rows. Then. For the mappings with multiple sources or targets. so there will be no easy way to rerun a single table. the spreadsheet can be sorted either by target table or source table. with each number representing a separate mapping. Apply the naming standards generated in 2. re-sort the spreadsheet by number.Number 1 2 3 4 4 Target Table Customers Products Customer_Type Orders_Item Orders_Item Source Cust_File Items Cust_File Tickets Ticket_Items When completed. give each mapping a name. The mapping will always load two or more target tables from the source. First.000 rows High – 100. determine for the project a threshold for a High. give both targets the same number. be sure to keep restartabilty/reloadability in mind. potentially the Customers table and the Customer_Type tables can be loaded in the same mapping. Next. The inventory would look similar to the following: Number 1 2 4 Target Table Customers Customer_Type Products Orders_Item Source Cust_File Items Tickets Ticket_Items At this point. When merging targets into one mapping in this manner.000 rows Med – 10. it is often helpful to record some additional information about each mapping to help with planning and maintenance. merge the data back into a single row to generate the inventory of mappings. Sorting by source table can help determine potential mappings that create multiple targets.000 to 100. For example.2 DESIGN DEVELOPMENT ARCHITECTURE. These names can then be used to distinguish mappings from each other and also can be put on the project plan as individual tasks. the following thresholds might apply: Low – 1 to 10. When using a source to populate multiple tables at once for efficiency. Medium.000 rows + INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-111 . in a warehouse where dimension tables are likely to number in the thousands and fact tables in the hundred thousands.

Add any other columns of information that might be useful to capture about each mapping. Med or Low) to each of the mappings based on the expected volume of data to pass through the mapping. initial estimate. such as a high-level description of the mapping functionality. resource (developer) assigned.Assign a likely row volume (High. or complexity rating. these mappings will be the first candidates for performance tuning. PAGE BP-112 BEST PRACTICES INFORMATICA CONFIDENTIAL . actual completion time. These high level estimates will help to determine how many mappings are of ‘High’ volume.

' from user_indexes where INDEX_NAME like 'OPB_%' This will produce output like: 'ANALYZETABLE' TABLE_NAME 'COMPUTESTATISTICS. For the repository tables. Oracle Run the following queries: select 'analyze table '. and SQL query optimizers may choose a less-than-optimal query plan. Most databases keep and use column distribution statistics to determine which index to use in order to optimally execute SQL queries.' analyze table OPB_ANALYZE_DEP compute statistics.Updating Repository Statistics Challenge The PowerCenter repository has more than eighty tables. choosing a sub-optimal query plan can drastically affect performance. it is helpful to understand that all PowerCenter repository tables and index names begin with "OPB_" or "REP_". Database servers do not update these statistics continuously. ' compute statistics. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-113 . In large repositories. it is useful for Database Administrators to create scripts to automate the task. table_name.' from user_tables where table_name like 'OPB_%' select 'analyze index '. Because the statistics need to be updated table by table. Description The Database Administrator needs to continually update the database statistics to ensure that they remain up-to-date. so they quickly become outdated in frequently-used repositories. analyze table OPB_ATTR compute statistics. As a result. ' compute statistics. the repository becomes slower and slower over time. and nearly all use one or more indexes to speed up queries. INDEX_NAME. The following information is useful for generating scripts to update distribution statistics. The frequency of updating depends on how heavily the repository is used.

'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS. Save the output to a file. . . .' analyze index OPB_DBD_IDX compute statistics. analyze index OPB_EXPR_IDX compute statistics. Save the output to a file.analyze table OPB_BATCH_OBJECT compute statistics. MS SQL Server Run the following query: select 'update statistics '. This updates statistics for the repository tables. analyze index OPB_DIM_LEVEL compute statistics. then edit the file and remove the header information (i. . name from sysobjects where name like 'OPB_%' This will produce output like name update statistics OPB_ANALYZE_DEP update statistics OPB_ATTR update statistics OPB_BATCH_OBJECT PAGE BP-114 BEST PRACTICES INFORMATICA CONFIDENTIAL .. (i. Run this as a SQL script. name from sysobjects where name like 'OPB_%' This will produce output like : name update statistics OPB_ANALYZE_DEP update statistics OPB_ATTR update statistics OPB_BATCH_OBJECT . . Sybase Run the following query: select 'update statistics '. This updates statistics for the repository tables. . the lines that look like: 'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS. edit the file and remove all the headers. Then.e. the top two lines) and add a 'go' at the end of the file.' Run this as a SQL script.e..

the top line that looks like: (constant) tabname (constant) Run this as a SQL script. . This updates statistics for the repository tables.e.OPB_BATCH_OBJECT INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-115 . and add a 'go' at the end of the file. tabname. ' . Save the output to a file.' from systables where tabname like 'opb_%' or tabname like 'OPB_%'.OPB_ANALYZE_DEP and indexes all. then edit the file and remove the header information (i.')||tabname. This updates statistics for the repository tables.' from sysstat. ' and indexes all. Informix Run the following query: select 'update statistics low for table '. runstats on table PARTH. then remove the header information (i. Save the output to a file.e. the top two lines). . .OPB_ATTR and indexes all. Run this as a SQL script. (rtrim(tabschema)||'.. update statistics low for table OPB_ATTR .tables where tabname like 'OPB_%' This will produce output like: runstats on table PARTH... runstats on table PARTH. DB2 Run the following query : select 'runstats on table '. . . This will produce output like : (constant) tabname (constant) update statistics low for table OPB_ANALYZE_DEP . update statistics low for table OPB_BATCH_OBJECT .

Save the output to a file. . Run this as a SQL script to update statistics for the repository tables. . PAGE BP-116 BEST PRACTICES INFORMATICA CONFIDENTIAL .and indexes all. .

To that end. This is a high-level document that discusses the system to be maintained. becomes in effect. the day-to-day operation of the data warehouse is the responsibility of a Production Support Team. a Service Level Agreement and an Operations Manual. the Production Support team needs two documents. Service Level Agreement The Service Level agreement outlines how the overall data warehouse system will be maintained. to help in the support of the production data warehouse.Daily Operations Challenge Once the data warehouse has been moved to production. and identifies the groups responsible for monitoring the various components of the system. At a minimum. the components of the system. This team is typically involved with the support of other systems and has expertise in database systems and various operating systems. the most important task is keeping the system running and available for the end users. Description In most organizations. it should contain the following information: • • • • • • • Times when the system should be available to users Scheduled maintenance window Who is expected to monitor the operating system Who is expected to monitor the database Who is expected to monitor the Informatica sessions How quickly the support team is expected to respond to notifications of system failures Escalation procedures that include data warehouse team contacts in the event that the support team cannot resolve the system failure Operations Manual INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-117 . a customer to the Production Support team. The Data Warehouse Development team.

their frequency (daily. This manual should be self-contained. providing all of the information necessary for a production support operator to maintain the system and resolve most problems that may arise. At a minimum. This manual should contain information on how to maintain all components of the data warehouse system. the Operations Manual should contain: • • • • • Information on how to stop and re-start the various components of the system Ids and passwords (or how to obtain passwords) for the system components Information on how to re-start failed PowerCenter sessions A listing of all jobs that are run. monthly.The Operations Manual is crucial to the Production Support team because it provides the information needed to perform the maintenance of the data warehouse system.). and the average run times Who to call in the event of a component failure that cannot be resolved by the Production Support team PAGE BP-118 BEST PRACTICES INFORMATICA CONFIDENTIAL . etc. weekly.

However. The following paragraphs describe three possible solutions for load validation. beginning with a fairly simple solution and moving toward the more complex: 1. Post-session e-mails on either success or failure Post-session e-mail is configured in the session. Then. depending on the extent of error checking. session start times. the need for load validation varies. you must determine the source of this information. Do you want it stored as a flat file? Do you want it e-mailed to you? Do you want it available in a relational table. The first step is to determine what information you need for load validation (e. Finally.Load Validation Challenge Knowing that all data for the current load cycle has loaded correctly is essential for good data warehouse management. successful rows and failed rows). but you must have a means of extracting this information.. you must determine how you want this information presented to you. session names. data validation or data cleansing functionality inherent in the your mappings. batch names. under the General tab and ‘Session Commands’ A number of variables are available to simplify the text of the e-mail: %s Session name %e Session status %b Session start time %c Session completion time %i Session elapsed time %l Total records loaded INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-119 . so that history easily be preserved? All of these factors weigh in finding the correct solution for you. All this information is stored as metadata in the repository. session completion times.g. Description Methods for validating the load process range from simple to complex.

session_timestamp. session end time. session name. You can do this by sourcing the MX view REP_SESS_LOG and then performing lookups to other repository tables or views for additional information.- %r Total records rejected %t Target table details %m Name of the mapping used in the session %n Name of the folder containing the session %d Name of the repository containing the session %g Attach the session log to the message TIP: One practical application of this functionality is the situation in which a key business user waits for completion of a session to run a report. successful_rows.actual_start) * 24 * 60 * 60 from rep_sess_log a where session_timestamp = (select max(session_timestamp) from rep_sess_log where session_name =a. Use a mapping A more complex approach. session_name The sample output would look like this: Folder Name Web Analytics Web Analytics Finance Finance HR Session Name Session End Time 5/8/2001 7:49:18 AM 5/8/2001 5/8/2001 5/8/2001 5/8/2001 7:53:01 8:06:01 8:10:32 8:15:27 AM AM AM AM Successful Rows 12900 125000 35987 45 5 Failed Session Rows Duration (sec’s) 0 0 0 0 0 126 478 178 12 10 S M W DYNMIC KEYS FILE LOAD SMW LOAD WEB FACT SMW NEW LOANS SMW UPD LOANS SMW NEW PERSONNEL 3. You can configure email to this user. is a great place to start . Query the repository Almost any query can be put together to retrieve data about the load execution from the repository. notifying him/her that the session was successful and the report can run. The MX view. (session_timestamp .session_name) order by subject_area. session_name. This view is likely to contain all the information you need. 2. The following sample query shows how to extract folder name. and the most customizable. REP_SESS_LOG. The following graphic illustrates a sample mapping: PAGE BP-120 BEST PRACTICES INFORMATICA CONFIDENTIAL . is to create a PowerCenter mapping to populate a table or flat file with desired information. successful rows and session duration: select subject_area.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-121 .This mapping selects data from REP_SESS_LOG and performs lookups to retrieve the absolute minimum and maximum run times for that particular session. This enables you to compare the current execution time with to the minimum and maximum durations.

many companies require the use of a third-party scheduler that is the company standard. Low Level Low level integration refers to a third-party scheduler kicking off only one Informatica session or a batch. Medium Level. there are several levels at which to integrate a third-party scheduler with PowerCenter. That initial PowerCenter process subsequently kicks off the rest of the sessions and batches. This type of integration is very simple and should only be used as a loophole to fulfill a corporate mandate on a standard scheduler. Third Party Scheduler Integration Levels In general. Description When moving into production. This Best Practice describes various levels to integrate a third-party scheduler. The PowerCenter scheduler handles all processes and dependencies after the third-party scheduler has kicked off the initial batch or session. A third-party scheduler can start and stop an Informatica session or batch using the PMCMD commands. The correct level of integration depends on the complexity of the batch/schedule and level and type of production support. and High Level. In this level of integration. there are three levels of integration between a third-party scheduler and Informatica: Low Level. A low level of integration is very simple to implement because the third-party scheduler kicks off only one process. The third-party scheduler is not adding any functionality that cannot be handled by the PowerCenter scheduler. Because PowerCenter has a scheduler. PAGE BP-122 BEST PRACTICES INFORMATICA CONFIDENTIAL . nearly all control lies with the PowerCenter scheduler.Third Party Scheduler Challenge Successfully integrate a third-party scheduler with PowerCenter.

to reduce total amount of work required to integrate the third-party scheduler and PowerCenter. Because the PowerCenter sessions are not part of any batches. The thirdparty scheduler controls all dependencies between the sessions. many of the PowerCenter sessions may be left in batches. but not necessarily the specific session. Thus. PowerCenter may have several sessions defined with dependencies. one of the main disadvantages of this level of integration is that if a batch fails at some point. However. This type of integration is the most complex to implement because there are many more interactions between the third-party scheduler and PowerCenter. Therefore. This type of integration is more complex than low level integration because there is much more interaction between the third-party scheduler and PowerCenter. High Level High level integration is when a third-party scheduler has full control of scheduling and kicks off all PowerCenter sessions. the majority of the production support burden falls back on the Project Development team. the production support burden is shared between the Project Development team and the Production Support team. In this level of integration. the control is shared between PowerCenter and a third-party scheduler. but not all sessions. Because many companies only have Production Support personnel with knowledge in the company’s standard scheduler. They are probably able to determine the general area. Thus.Low level integration requires production support personnel to have a thorough knowledge of PowerCenter. the production support burden lies with the Production Support team. one significant disadvantage of this level of integration is that if the batch fails at some point. the Production Support personnel may not be able to determine the exact breakpoint. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-123 . Thus. This reduces the integration chores because the third-party scheduler is only communicating with a limited number of PowerCenter batches. A third-party scheduler may kick off several PowerCenter batches and sessions but within those batches. PowerCenter is controlling the dependencies within those batches. the Production Support personnel are usually able to determine the exact breakpoint. High level integration allows the Production Support personnel to have only limited knowledge of PowerCenter. Because Production Support personnel in many companies are knowledgeable only about the company’s standard scheduler. Medium level integration requires Production Support personnel to have a fairly good knowledge of PowerCenter. one of the main advantages of this level of integration is that if the batch fails at some point. Thus. Because the Production Support personnel in many companies are knowledgeable only about the company’s standard scheduler. Medium Level Medium level integration is when a third-party scheduler kicks off many different batches or sessions. the Production Support personnel may not be able to determine the exact breakpoint. the third-party scheduler controls all dependencies among the sessions.

PAGE BP-124 BEST PRACTICES INFORMATICA CONFIDENTIAL .

The file used as the indicator file must be able to be located by the PowerCenter Server.Event Based Scheduling Challenge In an operational environment. The mere existence of the dummy file is enough to indicate that the session should start. file. When the session starts. It is. If the session is waiting on its source file to be FTP’ed from another server. therefore. the start of a session needs to be triggered by another session or other event. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-125 . is the use of indicator files. The dummy file will be removed immediately after it is located. the PowerCenter Server will look for the existence of this file and will remove it when it sees it. essential that you do not use your flat file source as the indicator file. This file can be an empty. the FTP process should be scripted so that it creates the indicator file upon successful completion of the source file FTP. Description The indicator file configuration is specified in the session configuration. under advanced options. The best method of event-based scheduling with the PowerCenter Server. or dummy. much like a flat file source.

either in development or production. and Informatica recommends using both methods. To determine which logs to eliminate. you may need to analyze the tables in the repository to facilitate data retrieval. thereby increasing performance. The native PowerCenter backup is required. A number of best practices are available to facilitate the tasks involved with this responsibility. is extremely important. Removing unnecessary data from these tables will expedite the repository backup process as well as the folder copy operation. Description The following paragraphs describe several of the key tasks involved in managing the repository: Backing Up the Repository Two back-up methods are advisable for repository backup: (1) either the PowerCenter Repository Manager or ‘pmrep’ command line utility. if folder copies are taking an unusually long time.Repository Administration Challenge The task of managing the repository. the native PowerCenter backup provides a clean backup that can be restored to a new database. including connectivity to the PowerCenter repository. Purging Old Session Log Information Similarly. If database corruption occurs. and (2) the traditional database backup method. Analyzing Tables in the Repository If operations in any of the client tools. although both are not essential. are slowing down . execute the following select statement to retrieve the sessions with the most entries in OPB_SESSION_LOG: PAGE BP-126 BEST PRACTICES INFORMATICA CONFIDENTIAL . the OPB_SESSION_LOG and/or OPB_SESS_TARG_LOG tables may be being transferred.

The following examples illustrate the use of pmrep: Example 1: Script to backup PowerCenter Repository echo Connecting to repository <Informatica Repository Name>..session_id=c. • • Command line mode lets you execute pmrep commands from the windows command line.subj_id group by subj_name.session_id and b. Copy the original session. opb_subject b.. then delete original session. d:\PROGRA~1\INFORM~1\pmrep\pmrep connect -r <Informatica Repository Name> -n <Repository User Name> -x <Repository Password> -t <Database Type> -u <Database User Name> -p < Database Password> -c <Database Connection String> echo Starting Repository Backup. This utility is a command-line program for Windows 95/98 or Windows NT/2000 to update session-related parameters in a PowerCenter repository. It is not currently available for UNIX.0 to facilitate repository administration and server level administration. count(*) from opb_session_log a. It is a standalone utility that installs in the PowerCenter Client installation directory. 2.. Interactive mode invokes pmrep and allows you to issue a series of commands from a pmrep prompt without exiting after each command..subj_id=c.. then selecting Delete from the Edit menu. opb_load_session c where a. Respond ‘Yes’ when the system prompts you with the question “‘Delete these logs from the Repository?” pmrep Utility The pmrep utility was introduced in PowerCenter 5. You can manually delete any of these by highlighting a particular log. the entries in the tables are deleted. When a session is copied. The pmrep utility has two modes: command line and interactive mode.select subj_name. When you select one of the sessions. all of the session logs will appear on the righthand side of the screen. When you delete the session. d:\PROGRA~1\INFORM~1\pmrep\pmrep backup -o Output File Name> echo Clearing Connection… d:\PROGRA~1\INFORM~1\pmrep cleanup echo Repository Backup is Complete.. Command line mode is useful for batch files or scripts. This mode invokes and exits each time a command is issued. sessname. the entries in the repository tables do not duplicate. sessname order by count(*) desc 1. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-127 . eliminating all rows for an individual session. Log into Repository Manager and expand the sessions in a particular folder.

Be sure to have the appropriate data source configured under the exact same name as the registry you are going to import. the connection fails.Example 2: Script to update database connection information echo Connecting to repository Informatica Repository <Informatica Repository Name>. PAGE BP-128 BEST PRACTICES INFORMATICA CONFIDENTIAL .. The section of the registry that you can import and export contains the following repository connection information: • • • • Repository name Database username and password (must be in US-ASCII) Repository username and password (must be in US-ASCII) ODBC data source name (DSN) The registry does not include the ODBC data source. To simplify the process of setting up client machines. d:\PROGRA~1\INFORM~1\pmrep\pmrep connect -r <Informatica Repository Name> -n <Repository User Name> -x <Repository Password> -t <Database Type> -u <Database User Name> -p < Database Password> -c <Database Connection String> echo Begin Updating Connection Information for <Database Connection Name>… d:\PROGRA~1\INFORM~1\pmrep\pmrep updatedbconfig –d <Database Connection Name> –u <New Database Username> –p <New Database Password> –c <New Database Connection String> -t <Database Type> echo Clearing Connection… d:\PROGRA~1\INFORM~1\pmrep cleanup echo Completed Updating Connection Information for <Database Connection Name>… Export and Import Registry The Repository Manager saves repository connection information in the registry. for each imported DSN. If you import a registry containing a DSN that does not exist on that client system.. and then import it to a different client machine (as long as both machines use the same operating system). you can export the connection information.

cfg file instead of the physical IP addresses of the servers. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-129 . If the machine hosting the PowerCenter Server goes down. another machine must recognize this and start another Server and assume responsibility for running the sessions and batches. running Solaris OS Sun High-Availability Clustering Software External EMC storage. To facilitate this. with each server owning specific disks PowerCenter installed on a separate disk that is accessible by both servers in the cluster. only one pmserver.High Availability Challenge In a highly available environment. The PowerCenter Server must be running at all times. In addition. while the other server in the cluster is the secondary server. Description While there are many types of hardware and many ways to configure a clustered environment. Thus. When the primary server goes down. This logical IP address is specified in the pmserver. the Sun highavailability software changes the ownership of the disk where the PowerCenter Server is installed from the primary server to the secondary server. the Sun high-availability software automatically starts the PowerCenter Server on the secondary server using the basic auto start/stop scripts that are used in many UNIX environments to automatically start the PowerCenter Server whenever a host is rebooted. This is best accomplished in a clustered environment. but only by one server at a time One of the Sun 4500’s serves as the primary data integration server.cfg file is needed. Under normal operations. the PowerCenter Server ‘thinks’ it is physically hosted by the primary server and uses the resources of the primary server. this example is based on the following hardware and software characteristics: • • • • 2 Sun 4500. although it is physically located on its own server. load schedules cannot be impacted by the failure of physical hardware. a logical IP address can be created specifically for the PowerCenter Server.

PAGE BP-130 BEST PRACTICES INFORMATICA CONFIDENTIAL .

4. This indicates that the source data is arriving quickly. Perform Benchmarking. and target have been tuned to their peak performance should the mapping be analyzed for tuning. the target is inserting the data quickly. Monitor the server. Re-run the session and monitor the performance details. This time look at the details and watch for the Buffer Input and Outputs for the sources and targets. By running a session and monitoring the server. the DTM should be the slowest portion of the session details. 1. Description Performance tuning procedures consist of the following steps in a pre-determined order to pinpoint where tuning efforts should be focused. 6. Benchmark the sessions to set a baseline to measure improvements against 2. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-131 . 3. This is the optimum desired performance. source. Only minor tuning of the session can be conducted at this point and usually has only a minor effect. it should immediately be apparent if the system is paging memory or if the CPU load is too high for the number of available processors. 5.. If the system is paging. When the source and target are optimized. increasing the physical memory available on the machine) can greatly improve performance. and the actual application of the business rules is the slowest portion. Use the performance details.Recommended Performance Tuning Procedures Challenge Efficient and effective performance tuning for PowerCenter products. correcting the system to prevent paging (e. re-run the session to determine the impact of the changes. Tune the source system and target system based on the performance details. Only after the server.g. After the tuning achieves a desired level of performance.

comparing the new performance with the old performance.7. In some cases. PAGE BP-132 BEST PRACTICES INFORMATICA CONFIDENTIAL . re-run the sessions that have been identified as the benchmark. optimizing one or two sessions to run quickly can have a disastrous effect on another mapping and care should be taken to ensure that this does not occur. Finally.

Oracle Performance Tuning Tools Oracle offers many tools for tuning an Oracle instance. with each query having an immediate hit. SQL Trace. enabling the DBA to draw conclusions about database performance. • Explain Plan Explain Plan. so we’ve included only a short description of some of the major ones here. The SQL in a source qualifier or in a lookup that is running INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-133 . which allows a user to view for individual V$ views or the ‘SELECT ANY TABLE’ privilege. This Best Practice covers tips on tuning several databases: Oracle.Performance Tuning Databases Challenge Database tuning can result in tremendous improvement in loading performance. • V$ Views V$ views are dynamic performance views that provide real-time information on database activity. Keep in mind that querying these views impacts database performance. which allows the ‘ANY’ keyword to apply to SYS owned objects. SQL Server and Teradata. You can grant viewing privileges with either the ‘SELECT’ privilege. only SYS can query them. Most DBAs are already familiar with these tools. Using the SELECT ANY TABLE option requires the ‘O7_DICTIONARY_ACCESSIBILITY’ parameter be set to ‘TRUE’. Because SYS is the owner of these views. With this in mind. and TKPROF are powerful tools for revealing bottlenecks and developing a strategy to avoid them. Explain Plan allows the DBA or developer to determine the execution path of a block of SQL code. which allows the user to view all V$ views. carefully consider which users should be granted the privilege to query these views.

the payoff is well worth the effort in terms of performance gains.txt.’ This report should give the DBA a fairly complete idea about the level of usage the database experiences and reveal areas that should be addressed. Memory and Processing Memory and processing configuration is done in the init. they are not fighting for the same resource. Disk I/O Disk I/O at the database level provides the highest level of performance gain in most systems. Rollback files should be separated onto their own disks because they have significant disk I/O.for a long time should be generated and copied to SQL*PLUS or other SQL tool and tested to avoid inefficient execution of these statements. Separate indexes so that when queries run indexes and tables. • UTLBSTAT & UTLESTAT Executing ‘UTLBSTAT’ creates tables to store dynamic performance statistics and begins the statistics collection process. • SQL Trace SQL Trace extends the functionality of Explain Plan by providing statistical information about the SQL statements executed in a session that has tracing enabled. Also be sure to implement disk striping. • TKPROF The output of SQL Trace is provided in a dump file that is difficult to read. While this type of planning is time consuming. both loading and querying). so you need to run this utility for a long while and through several operations (i. ‘UTLESTAT’ ends the statistics collection process and generates an output file called ‘report. Database files should be separated and identified. this. PAGE BP-134 BEST PRACTICES INFORMATICA CONFIDENTIAL .e. Review the PowerCenter session log for long initialization time (an indicator that the source qualifier may need tuning) and the time it takes to build a lookup cache to determine if the SQL for these transformations should be tested. a standard set of parameters to optimize PowerCenter is not practical and will probably never exist. Co-locate tables that are heavily used with tables that are rarely used to help minimize disk contention.ora file. TKPROF formats this dump file into a more understandable report. This utility is run for a session with the ‘ALTER SESSION SET SQL_TRACE = TRUE’ statement. or RAID technology can help immensely in reducing disk contention. Accumulating statistics may take time. Because each database is different and requires an experienced DBA to analyze and tune it for optimal performance.. Run this utility after the database has been up and running (for hours or days).

If this parameter is not set. o The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command.4 set to make use of the parallel query option to facilitate parallel processing of queries and indexes. o Maximum number of query servers or parallel recovery processes for an instance.TIP: Changes made in the init. For such queries.) Optimizer_percent_parallel=33 This parameter defines the amount of parallelism that the optimizer uses in its cost functions. • • INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-135 . a RULE hint or optimizer mode or goal is ignored. • parallel_max_servers=40 o Used to enable parallel query. The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. Parallel_min_servers=8 o Used to enable parallel query.3. (Note: ALTER SESSION refers to the Database Administration command issued at the svrmgr command prompt. while high values favor table scans. Low values favor indexes. its value defaults to twice the value of the SORT_AREA_SIZE parameter. We’ve also included the descriptions and documentation from Oracle for each setting to help DBAs of other (nonOracle) systems to determine what the commands do in the Oracle environment to facilitate setting their native database commands and settings in a similar fashion. to be used for the hash join. o Initially not set on Install.ora file will take effect after a restart of the instance. A value of 100 means that the optimizer uses each object's degree of parallelism in computing the cost of a full table scan operation. Use of a FIRST_ROWS hint or optimizer mode overrides a nonzero setting of OPTIMIZER_PERCENT_PARALLEL. The settings presented here are those used in a 4-CPU AIX server running Oracle 7. in bytes. The default of 0 means that the optimizer chooses the best serial plan. • HASH_AREA_SIZE = 16777216 o Default value: 2 times the value of SORT_AREA_SIZE o Range of values: any integer o This parameter specifies the maximum amount of memory. Cost-based optimization is always used for queries that reference an object with a nonzero degree of parallelism. Use svrmgr to issue the commands “shutdown” and “startup” (eventually “shutdown immediate”) to the instance.

a fact mapping that was using a lookup to get five columns (including a foreign key) and about 500. in bytes. particularly the CREATE INDEX statements. In one case. the total time decreased from 24 minutes to 8 minutes for ~120130 bytes/row. In another mapping. Performance went from about 2Mb/min (280 rows/sec) to about 10Mb/min (1360 rows/sec). A normal tcp (network tcp/ip) connection in tnsnames. primary key with unique index in place. as in a full database import. Minimum number of query server processes for an instance. there is only one memory area of SORT_AREA_SIZE for each user process at any time. the memory is released down to the size specified by SORT_AREA_RETAINED_SIZE. of Program Global Area (PGA) memory to use for a sort. After the sort is complete and all that remains to do is to fetch the rows out. PMServer and Oracle target on same box). After the last row is fetched out. For example. The memory is released back to the PGA.000 rows from a table was taking 19 minutes. this parameter may need to be adjusted. Changing the connection type to IPC reduced this to 45 seconds. Multiple allocations never exist. o The default is usually adequate for most database operations. 500. not to the operating system. This is also the number of query server processes Oracle creates when the instance is started. if one process is doing all database access. using an IPC connection can significantly reduce the time it takes to build a lookup cache.o o Initially not set on Install. if very large indexes are created.armafix = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL =TCP) (HOST = armafix) (PORT = 1526) ) ) (CONNECT_DATA=(SID=DW) ) ) PAGE BP-136 BEST PRACTICES INFORMATICA CONFIDENTIAL .ora would look like this: DW.e.000 row write (array inserts). then an increased value for this parameter may speed the import. all memory is freed.. IPC as an Alternative to TCP/IP on UNIX On an HP/UX server with Oracle as a target (i. o Increasing SORT_AREA_SIZE size improves the efficiency of large sorts. • SORT_AREA_SIZE=8388608 o Default value: Operating system-dependent o Minimum value: the value equivalent to two database blocks o This parameter specifies the maximum amount. However.

' || TABLE_NAME || ' DISABLE CONSTRAINT ' || CONSTRAINT_NAME || ' . Run the following to generate output to disable the foreign keys in the data warehouse: SELECT 'ALTER TABLE ' || OWNER || '. then generate SQL statements as output to disable and enable these indexes.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011075 . then writing another SQL statement to rebuild it can be a very tedious process.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011060 .armafix = (DESCRIPTION = (ADDRESS = (PROTOCOL=ipc) (KEY=DW) ) (CONNECT_DATA=(SID=DW)) ) Improving Data Load Performance • Alternative to Dropping and Reloading Indexes Dropping and reloading indexes during very large loads to a data warehouse is often recommended but there is seldom any easy way to do this. it is an easy matter to write a SQL statement that queries this table.Make a new entry in the tnsnames like this. With this in mind. Oracle 7 (and above) offers an alternative to dropping and rebuilding indexes by allowing you to disable and re-enable existing indexes. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-137 . and use it for connection to the local Oracle instance: DWIPC. For example. ALTER TABLE MDDB_DEV.' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'R' This produces output that looks like: ALTER TABLE MDDB_DEV. Oracle stores the name of each index in a table that can be queried. writing a SQL statement to drop each index. ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011077 .

CUSTOMER_DIM DISABLE PRIMARY KEY .CUSTOMER_SALES_FACT DISABLE PRIMARY KEY .CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011134 . ALTER TABLE MDDB_DEV. ALTER TABLE MDDB_DEV.' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'P' This produces output that looks like: ALTER TABLE MDDB_DEV.ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011071 . ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011131 . Save the results in a single file and name it something like ‘DISABLE.' || TABLE_NAME || ' DISABLE PRIMARY KEY .' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'U' ALTER TABLE MDDB_DEV.' || TABLE_NAME || ' DISABLE PRIMARY KEY .CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011133 . ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE PRIMARY KEY . disable any unique constraints with the following: SELECT 'ALTER TABLE ' || OWNER || '. Dropping or disabling primary keys will also speed loads. Run the results of this SQL statement after disabling the foreign key constraints: SELECT 'ALTER TABLE ' || OWNER || '. ALTER TABLE MDDB_DEV.SQL’ PAGE BP-138 BEST PRACTICES INFORMATICA CONFIDENTIAL . ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011070 .CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011059 . Finally.

The CONVENTIONAL path is the default method for SQL*Loader. such as ‘OPTIONS (DIRECT = TRUE)’. To use the Oracle bulk loader. rerun these queries after replacing ‘DISABLE’ with ‘ENABLE. but this also slows queries (such as lookups) and updates. merely add ‘OPTIONS (OPTION = TRUE) to beginning of the control file. SQL*Loader has several options that can improve data loading performance and are easy to implement. and evaluates constraints. You may want to experiment to determine which method is faster. If you use lookups and updates (especially on large tables). This performs like a typical INSERT statement that updates indexes. you can exclude the index that will be used for the lookup from your script. If you do not use lookups or updates on your target tables you should get a boost by using this SQL statement to generate scripts. bypassing all SQL INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-139 . The DIRECT path obtains an exclusive lock on the table being loaded and writes the data blocks directly to the database files. These options are: • • • • DIRECT PARALLEL SKIP_INDEX_MAINTENANCE UNRECOVERABLE A control file normally has the following format: LOAD DATA INFILE <dataFile> APPEND INTO TABLE <tableName> FIELDS TERMINATED BY '<separator>' (<list of all attribute names to load>) To use any of these options.SQL’ and run it as a post-session command. SQL*Loader • Loader Options SQL*Loader is a bulk loader utility used for moving data from external files into the Oracle database.’ Save the results in another file with a name such as ‘ENABLE. Re-enable the unique constraints first. which specifies how data should be loaded into the database. TIP: Dropping or disabling foreign keys will often boost loading. Re-enable constraints in the reverse order that you disabled them. fires triggers.To re-enable the indexes. you need a control file. and re-enable primary keys before foreign keys.

the space savings is dramatic. If the CONVENTIONAL path must be used (i. transformations are performed during the load. You will have to rebuild the indexes after the load. for example). size. but overall performance may improve significantly. A typical example of a low cardinality field is gender – it is either male or female (or possibly unknown). Oracle will default to btree.3.processing. however. by create the Oracle target table with the same number of partitions as the session.x. but is not much help for low cardinality/highly duplicated data and may even increase query time. The UNRECOVERABLE option in the control file allows you to redo log writes during a CONVENTIONAL load. and ability to create and drop very quickly. If the partitions are located on separate disks. A b-tree index can greatly improve query performance on data that has high cardinality or contains mostly unique values. Keep in mind. and can significantly improve query performance. Recoverability should not be an issue since the data file still exists. Loading Partitioned Sessions To improve performance when loading data to an Oracle database using a partitioned session. Optimizing Query Performance • Oracle Bitmap Indexing With version 7. But it is important to note that when a bitmap-indexed column is PAGE BP-140 BEST PRACTICES INFORMATICA CONFIDENTIAL . If you don’t specify an index type when creating an index. but not PRIMARY KEY. Disabling these constraints with the SQL scripts described earlier will benefit performance when loading data into a target warehouse.. the performance time can be reduced to that of loading a single partition.e. Bitmap indexes are suited to data warehousing because of their performance. then you can bypass index updates by using the SKIP_INDEX_MAINTENANCE option. that b-tree indexing is still the Oracle default. Also note that for certain columns. This kind of data is an excellent candidate for a bitmap index. UNIQUE KEY and NOT NULL constraints. The PARALLEL option can be used with the DIRECT option when loading multiple partitions of the same table. Since most dimension tables in a warehouse have nearly every column indexed. The DIRECT option automatically disables CHECK and foreign key REFERENCES constraints. bitmaps will be smaller and faster to create than a b-tree index on the same column. Oracle added bitmap indexing to supplement the traditional b-tree index. Note that no other users can write to the loading table due to this exclusive lock and no SQL transformations can be made in the control file during the load.

To enable bitmap indexes. all_indexes.0 # or higher event = "10111 trace name context forever" event = "10112 trace name context forever" INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-141 .ora file and if there are single column bitmapped indexes on the fact table foreign keys. then joins back to the Fact table. drop index emp_gender. and user_indexes with the word ‘BITMAP’ in the Uniqueness column rather than the word ‘UNIQUE.. a ‘star query’ may be created that accesses the Fact table first followed by the Dimension table joins.g. create bitmap index emp_gender_bit on emp (gender). All other syntax is identical. To specify a bitmap index.’ Bitmap indexes cannot be unique. add the word ‘bitmap’ between ‘create’ and ‘index’. create index emp_gender on emp (gender). create bitmap index emp_active_bit on emp (active_flag). This ‘star query’ access method is only used if the STAR_TRANSFORMATION_ENABLED parameter is equal to TRUE in the init. you must set the following items in the instance initialization file: • • • compatible = 7. bitmap indexes are rebuilt after each DML statement (e. Also.3. every row associated with that bitmap entry is locked. which can make loads very slow.0. Information for bitmap indexes in stored in the data dictionary in dba_indexes. For this reason. Creating bitmap indexes is similar to creating b-tree indexes. • Bitmap indexes: drop index emp_active_bit. drop index emp_gender_bit. making bitmap indexing a poor choice for OLTP database tables with constant insert and update traffic. • B-tree indexes: drop index emp_active. With a bitmapped index on the Fact table.2. avoiding a Cartesian product of all possible Dimension attributes. inserts and updates). With a b-tree index on the Fact table.updated. a query processes by joining all the Dimension tables in a Cartesian product based on the WHERE clause. create index emp_active on emp (active_flag). The relationship between Fact and Dimension keys is another example of low cardinality. it is a good idea to drop or disable bitmap indexes prior to the load and recreate or re-enable them after the load.

' FROM USER_INDEXES WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') This generates the following results: ANALYZE INDEX SYS_C0011125 COMPUTE STATISTICS. ANALYZE TABLE MARKET_DIM COMPUTE STATISTICS. If you try to create bitmap indexes without the parallel query option. a syntax error will appear in your SQL statement. the keyword ‘bitmap’ won't be recognized. PAGE BP-142 BEST PRACTICES INFORMATICA CONFIDENTIAL . • TIP: To check if the parallel query option is installed. Index Statistics • Table Method Index statistics are used by Oracle to determine the best method to access tables and should be updated periodically as normal DBA procedures. the word ‘parallel’ appears in the banner text.• event = "10114 trace name context forever" Also note that the parallel query option must be installed in order to create bitmap indexes. start and log into SQL*Plus. The following will improve query results on Fact and Dimension tables (including appending and updating records) by updating the table and index statistics for the data warehouse: The following SQL statement can be used to analyze the tables in the database: SELECT 'ANALYZE TABLE ' || TABLE_NAME || ' COMPUTE STATISTICS. The following SQL statement can be used to analyze the indexes in the database: SELECT 'ANALYZE INDEX ' || INDEX_NAME || ' COMPUTE STATISTICS. ANALYZE TABLE VENDOR_DIM COMPUTE STATISTICS.' FROM USER_TABLES WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') This generates the following results: ANALYZE TABLE CUSTOMER_DIM COMPUTE STATISTICS. If the parallel query option is installed.

TIP: When using a table alias in the SQL Statement. For this reason.4) */ …. Otherwise. Parallelism Parallel execution can be implemented at the SQL statement. Use ‘estimate’ instead of ‘compute’ in the above examples. SELECT /*+ PARALLEL_INDEX(order_fact. be sure to use this alias in the hint.ANALYZE INDEX SYS_C0011119 COMPUTE STATISTICS.DBMS_UTILITY. it is often acceptable to estimate the statistics rather than compute them. If data warehouse indexes are the only indexes located in a single schema. BDB is the schema for which the statistics should be updated. or instance level for many SQL operations. If you find the exact computation of the statistics consumes too much time. ANALYZE INDEX SYS_C0011105 COMPUTE STATISTICS. then you can use the following command to update the statistics: EXECUTE SYS. TIP: These SQL statements can be very resource intensive. • SQL Level Parallelism Hints are used to define parallelism at the SQL statement level.4) */ …. Example of improper use of alias: INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-143 . The degree of parallelism should be identified based on the number of processors and disk drives on the server. 'compute'). Note that the DBA must grant the execution privilege for dbms_utility to the database user executing this command. the hint will not be used. and you will not receive an error message. database object. The following examples demonstrate how to utilize four processors: SELECT /*+ PARALLEL(order_fact. with the number of processors being the minimum degree. we recommend running them at off-peak times when no other process is using the database. Save these results as a SQL script to be executed before or after a load.Analyze_Schema ('BDB'. order_fact_ixl. especially for very large tables. • Schema Method Another way to update index statistics is to compute indexes by schema rather than by table. In this example.

the parallel hint will not be used because of the used alias “A” for table EMP. ENAME FROM EMP A Here. this may be a security issue since both username and password are hard-coded and unencrypted. The correct way is: SELECT /*+PARALLEL (A.SQL In some environments. Additional Tips • Executing Oracle SQL Scripts as Pre and Post Session Commands on UNIX You can execute queries as both pre. you would execute the following as a post-session command: sqlplus -s pmuser/pmuser@infadb @ /informatica/powercenter/Scripts/ENABLE. 4) */ EMPNO. . the format of the command is: sqlplus –s user_id/password@database @ script_name. to execute the ENABLE. ENAME FROM EMP A • Table Level Parallelism Parallelism can also be defined at the table and index level. The following example demonstrates how to set a table’s degree of parallelism to four for all eligible SQL statements on this table: ALTER TABLE order_fact PARALLEL 4. PAGE BP-144 BEST PRACTICES INFORMATICA CONFIDENTIAL . . In the following example. Create the Oracle user “pmuser” with the following SQL statement: CREATE USER PMUSER IDENTIFIED EXTERNALLY DEFAULT TABLESPACE . . use the operating system’s authentication to log onto the database instance. To avoid this.and post-session commands. Ensure that Oracle is not contending with other processes for these resources or you may end up with degraded performance due to resource contention. 4) */ EMPNO.SELECT /*+PARALLEL (EMP. TEMPORARY TABLESPACE . For a UNIX environment. the Informatica id “pmuser” is used to log onto the Oracle database.SQL file created earlier (assuming the data warehouse is on a database named ‘infadb’).sql For example. .

these pages will stay in RAM longer. If database I/O (input/output operations to the physical disk subsystem) can be reduced to the minimal required set of data and index pages. which may reduce the number of selected rows. Managing performance on an SQL Server encompasses the following points. The primary goal of performance tuning is to reduce I/O so that buffer cache is best utilized. Oracle fetches all of the data from both tables. To force the Oracle optimizer to process the join on the source instance. “pmuser” (the id Informatica is logged onto the operating system as) is automatically passed from the operating system to the database and used to execute the script: sqlplus -s /@infadb @/informatica/powercenter/Scripts/ENABLE. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-145 . moves the data across the network to the target instance. use the ‘Generate SQL’ option in the source qualifier and include the ‘driving_site’ hint in the SQL statement as: SELECT /*+ DRIVING_SITE */ …. However. • DRIVING_SITE ‘Hint’ If the source and target are on separate instances. For example. you want to join two source tables (A and B) together. Too much unneeded data and index information flowing into buffer cache quickly pushes out valuable pages. Accessing data in RAM cache is much faster than accessing the same Information from disk. SQL Server Description Proper tuning of the source and target database is a very important consideration to the scalability and usability of a business analytical environment. • • • • • • Manage system memory usage (RAM caching) Create and maintain good indexes Partition large data sets and indexes Monitor disk I/O subsystem performance Tune applications and queries Optimize active data Manage RAM Caching Managing random access memory (RAM) buffer cache is a major consideration in any database server environment. then processes everything on the target instance. this causes a great deal of network traffic.In the following pre-session command. If either data source is large. the Source Qualifier transformation should be executed on the target instance.ora parameter “os_authent_prefix” to distinguish between “normal” oracle-users and “external-identified” ones.SQL You may want to use the init.

Only set cost threshold for parallelism on symmetric multiprocessors (SMP). these include: Full Recovery Bulk-Logged Recovery Simple Recovery Cost Threshold for Parallelism Option Use this option to specify the threshold where SQL Server creates and executes parallel plans. Set this option to 1 to suppress parallel plan generation. SQL Server creates and executes a parallel plan for a query only when the estimated cost to execute a serial plan for the same query is higher than the value set in cost threshold for parallelism. The server memory setting is configured automatically by SQL Server based on workload and available resources. Optimizing Disk I/O Performance PAGE BP-146 BEST PRACTICES INFORMATICA CONFIDENTIAL . Note that this setting is automated in SQL Server 2000 SQL Server allows several selectable models for database recovery. Set Working Set Size Option Use this option to reserve physical memory space for SQL Server that is equal to the server memory setting. The default value is 0. Set the value to a number greater than 1 to restrict the maximum number of processors used by a single query execution . SQL Server runs at a priority base of 13. Max Degree of Parallelism Option Use this option to limit the number of processors (a max of 32) to use in parallel plan execution.Several settings in SQL Server can be adjusted to take advantage of SQL Server RAM usage: • • Max async I/O is used to specify the number of simultaneous disk I/O operations (???) that SQL Server can submit to the operating system. If you set this option to one. Setting ‘set working set’ size means the operating system will not attempt to swap out SQL Server pages even if they can be used more readily by another process when SQL Server is idle. Priority Boost Option Use this option to specify whether SQL Server should run at a higher scheduling priority than other processors on the same computer. The cost refers to an estimated elapsed time in seconds required to execute the serial plan on a specific hardware configuration. which uses the actual number of available CPUs. The default is 0. which is a priority base of seven. It will vary dynamically between min server memory and max server memory.

performance can be improved by partitioning the data to increase the amount of disk I/O parallelism.999.999 rows will need to be rolled back out of the database before you attempt to reload the data. file groups. it is good to get in the habit of specifying a batch size for recoverability reasons. Unlike bcp. you attempt to load 1. BULK INSERT can only pull data into SQL Server. disk. SQL Server commits all rows to be loaded as a single batch.000. it is necessary to drive configuration around maximizing SQL Server disk I/O performance by load-balancing across multiple hard drives.e. If none is specified. Partitioning for Performance For SQL Server databases that are stored on multiple disk drives. The first mechanism is the bcp utility. The server suddenly loses power just as it finishes processing row number 999. • • Bcp is a command prompt utility that copies data into or out of SQL Server.999. Methods for creating and managing partitions include configuring your storage subsystem (i. tables and views.000 you could have saved significant recovery time. TIP: Both of these mechanisms enable you to exercise control over the batch size.000 rows of new data into a table. To build larger SQL Server databases however. BULK INSERT is a Transact-SQL statement that can be executed from within the database environment. Partitioning can be done using a variety of techniques. which will contain hundreds of gigabytes or even terabytes of data and/or that can sustain heavy read/write activity (as in a DSS application). Some possible candidates for partitioning include: • • • • • Transaction log Tempdb Database Tables Non-clustered indexes Using bcp and BULK INSERT Two mechanisms exist inside SQL Server to address the need for bulk movement of data..When configuring a SQL Server that will contain only a few gigabytes of data and not sustain heavy read or write activity. The second is the BULK INSERT statement. because SQL Server would have only had to rollback 9999 rows instead of 999. you need not be particularly concerned with the subject of disk I/O and balancing of SQL Server I/O activity across hard drives for maximum performance. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-147 . An advantage of using BULK INSERT is that it can copy data into instances of SQL Server using a Transact-SQL statement. When the server recovers. those 999. rather than having to shell out to the command prompt. Unless you are working with small volumes of data. RAID partitioning) and applying various data configuration mechanisms in SQL Server such as files. For example. By specifying a batch size of 10.

With PowerCenter 5. One of TPump’s advantages is that it does not lock the table that is being loaded. deletes. · Change from Full to Bulk-Logged Recovery mode unless there is an overriding need to preserve a point–in time recovery. This best practice will focus on MultiLoad since PowerCenter 5. FastLoad is used for loading inserts into an empty table. Tuning MultiLoad There are many aspects to tuning a Teradata database.General Guidelines for Initial Data Loads While loading data: • • • • • • Remove indexes Use Bulk INSERT or bcp Parallel load using partitioned data files into partitioned tables Run one load stream for each available CPU Set Bulk-Logged or Simple Recovery model Use TABLOCK option While loading data • • • Create indexes Switch to the appropriate recovery model Perform backups General Guidelines for Incremental Data Loads • • Load Data with indexes in place Performance and concurrency requirements should determine locking granularity (sp_indexoption). MultiLoad Parameters PAGE BP-148 BEST PRACTICES INFORMATICA CONFIDENTIAL . Teradata Description Teradata offers several bulk load utilities including FastLoad.x several aspects of tuning can be controlled by setting MultiLoad parameters to maximize write throughput.0. and “upserts” to any table. and TPump. Read operations should not affect bulk loads.x can auto-generate MultiLoad scripts and invoke the MultiLoad utility per PowerCenter target. updates. whereas in PowerCenter 5. MultiLoad supports inserts. the data is first written to file. the Informatica server transfers data via a UNIX named pipe to MultiLoad. MultiLoad. Note: In PowerCenter 5.1. Other areas to analyze when performing a MultiLoad job include estimating space requirements and monitoring MultiLoad performance. such as online users modifying the database during bulk loads.

it represents the interval in minutes between checkpoint operations. Load Mode. Tenacity. Also remember to account for the size of error tables since error tables are generated for each target table. but also allows you to set performance options. Drop Error Tables. Allows you to specify whether to drop or retain the three error tables for a MultiLoad session. When you set the checkpoint value to less than 60. data is preserved for restart operations after a system failure. or the restart log table. and no non-unique secondary indexes: INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-149 . A client based operand that is part of the logon string. Delete. assuming no fallback protection. Also validate that your date format is compatible with the date format specified in the Teradata database. If the checkpoint is set to a value greater than 60. Ensure that the date format used in your target flat file is equivalent to the date format parameter in your MultiLoad script. require a lot of extra permanent space. each MultiLoad job needs permanent space for: • • • Work tables Error tables Restart Log table Note: Spool space cannot be used for MultiLoad work tables. Interval in hours between MultiLoad attempts to log on to the database when the maximum number of sessions are already running. Set this parameter to 1 to drop error tables or 0 to retain error tables. To maximize write speed to the database. Max Sessions. in particular. and Upsert. it represents the number of records to write before performing a checkpoint operation. you can auto-generate MultiLoad scripts.With PowerCenter 5. this parameter specifies the maximum number of sessions that are allowed to log on to the database. Sleep. Available only in PowerCenter 5. Work tables. no journals. try to limit the number of checkpoint operations that are performed. Update. This not only enhances development. Spool space is freed at each restart.1. Checkpoint. By using permanent space for the MultiLoad tables. • • • • • • Estimating Space Requirements for MultiLoad Jobs Always estimate the final size of your MultiLoad target tables and make sure the destination has enough space to complete your MultiLoad job. Date Format. Here are the MultiLoad-specific parameters that are available in PowerCenter: • • TDPID. Available only in PowerCenter 5.1. A checkpoint interval is similar to a commit interval for other databases. This value should not exceed one per working amp (Access Module Process). Consider creating separate external loader connections for each method. Available load methods include Insert. Use the following formula to prepare the preliminary space estimate for one target table. this parameter specifies the number of minutes that MultiLoad waits before retrying a logon operation. selecting the one that will be most efficient for each target table. In addition to the space that may be required by target tables. error tables.x.

Check the DBC. Verify that the primary index is unique. as data is acquired from the client system. which is much slower than normal MultiLoad tasks. The MultiLoad job output lists the job phases and other useful information. Check for locks on the MultiLoad target tables and error tables. 3. • If the performance bottleneck is during the acquisition phase. 6. NUSIs degrade MultiLoad performance because the utility builds a separate NUSI change row to be applied to each NUSI sub-table after all of the rows have been applied to the primary table. Determine whether the target tables have non-unique secondary indexes (NUSIs). • 2. such as data bus or CPU capacities at or near 100 percent for one or more processors. then the issue may be with the client system. Monitoring MultiLoad Performance Here are some tips for analyzing MultiLoad performance: 1. 5. as data is applied to the target tables. Check the size of the error tables. Use the Teradata RDBMS Query Session utility to monitor the progress of the MultiLoad job.Resusage table for problem areas. Non-unique primary indexes can cause severe MultiLoad performance problems. Determine which phase of the MultiLoad job is causing poor performance. PAGE BP-150 BEST PRACTICES INFORMATICA CONFIDENTIAL . 4. 7.PERM = (using data size + 38) x (number of rows processed) x (number of apply conditions satisfied) x (number of Teradata SQL statements within the applied DML) Make adjustments to your preliminary space estimates according to the requirements and expectations of your MultiLoad job. Save these listings for evaluation. then the issue is not likely to be with the client system. Write operations to the fallback error tables are performed at normal SQL speed. If it is during the application phase.

Some swapping will normally occur regardless of the tuning settings. Description Running ps-axu Run ps-axu to check for the following items: • • • • Are there any processes waiting for disk access or for paging? If so check the I/O and memory subsystems. increase memory to prevent swapping. it should be increased. all are worthy of consideration. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-151 . Does ps show that your system is running many memory-intensive jobs? Look for jobs with a large set (RSS) or a high storage integral. causes a major performance decrease and increased I/O. On a memory-starved and I/O-bound server. If page swapping does occur at any time. To check swap space availability. If the swap space is too small for the intended applications. Swapping. you can get a snapshot of page swapping. This occurs because some processes use the swap space by their design.Performance Tuning UNIX Systems Challenge The following tips have proven useful in performance tuning UNIX-based machines. While some of these tips will be more helpful than others in a particular environment. use pstat and swap. What processes are using most of the CPU? This may help you distribute the workload better. By using sar 5 10 or vmstat 1 10. on any database system. Check the system to ensure that swapping does not occur at any time during the session processing. this can effectively shut down the PowerCenter process and any databases running on the server. Identifying and Resolving Memory Issues Use vmstat or sar to check swapping actions. What processes are using most of the memory? This may help you distribute the workload better.

If you don’t have vmsta –S. Are there a high number of address translation faults? (System V only) This suggests a memory shortage. Put performance-critical files on a filesystem with a large block size: 16KB or 32KB (BSD). Long bursts of swap-outs mean that active jobs are probably falling victim and indicate extreme memory shortage. this will almost certainly have the heaviest activity. are the most active disks also the fastest disks? Run sadp to get a seek histogram of disk activity. Take notice of how fairly disk activity is distributed among the system disks. This may reduce network performance.4 and SunOS 4. Occasional swap-outs are normal.). The buffer cache is not used in system V. but you may not care about them as much. BSD systems swap-out inactive jobs. spread evenly across the disk (tolerable). try following remedial steps: • • • • • • • Reduce the size of the buffer cache. you are extremely short of memory. If you have statically allocated STREAMS buffers. vmstat –S 5 to detect and confirm memory problems and check for the following: • • • Are pages-outs occurring consistently? If so. Try to limit the time spent running sendmail. add more memory. if your system has one. which is a memory hog. This may not help the memory problems. etc. Is activity concentrated in one area of the disk (good). Reduce the size of your kernel’s tables. Using symbolic links helps to keep the directory structure the same throughout while still moving the data files that are causing I/O contention. • PAGE BP-152 BEST PRACTICES INFORMATICA CONFIDENTIAL .Run vmstate 5 (sar –wpgr ) for SunOS. reduce the number of large (2048. This may limit the system’s capacity (number of files. Try running jobs requiring a lot of memory in a batch queue. by decreasing BUFPAGES. Alternatively. your system may perform satisfactorily. If it is not. Are swap-outs occurring consistently? If so. If you don’t see any significant improvement. Making the buffer cache smaller will hurt disk I/O performance. as well as CPU load. but netstat-m should give you an idea of how many buffers you really need. If only one memory-intensive job is running at a time. or in two welldefined peaks at opposite ends (bad)? • • • Reorganize your file systems and disks to distribute I/O activity as evenly as possible. number of processes. put performance-critical files into one filesystem and use the fastest drive for that filesystem. look at the w and de fields of vmstat. These should ALWAYS be zero. Iostat can be used to monitor the I/O load on the disks on the UNIX server. if single-file throughput is important. Try running jobs requiring a lot of memory at night. Using iostat permits monitoring the load on specific disks. Identifying and Resolving Disk I/O Issues Use iostat to check i/o load and utilization. Use your fastest disk drive and controller for your root filesystem.X systems. If memory seems to be the bottleneck of the system. you are short of memory.and 4096-byte) buffers.

but if it is always busy 100 percent of the time. source code files. In general though. the disk and I/O contention should be investigated to eliminate I/O bottleneck on the UNIX server. and restore). editor backup and auto-save files. and %usr has a high %idle. and other trash and deletes it automatically. and small data files). Use nice to lower the priority of CPU-bound jobs will improve interactive performance. it will soon become insufficient. • • • Eliminate unnecessary daemon processes. Use a smaller block size on file systems that are mostly small files (e. Consider upgrading your system. without letup? It is good for the CPU to be busy. using nice to raise the priority of CPU-bound jobs will expedite them but will hurt interactive performance. If %wio is higher. If your system has disk capacity problem and is constantly running out of disk space. You may not care if the CPU (or the memory or I/O system) is overloaded at night. use the disk quota system to prevent individual users from gathering too much storage.. Identifying and Resolving CPU Overload Issues Use sar –u to check for CPU loading. it is necessary to make memory changes to reduce the load on the system server. If your system is paging or swapping consistently. build a new filesystem. Rebuild your file systems periodically to eliminate fragmentation (backup. A target goal should be %usr + %sys= 80 and %wio = 10 leaving %idle at 10. provided the work is done in the morning. If the system shows a heavy load of %sys. Is the idle time always 0. using nice is really only a temporary solution. If you are running BSD UNIX or V. If your workload grows.• • • • Increase the size of the buffer cache by increasing BUFPAGES (BSD). When you run iostat 5above. In this case. Get users to run jobs at night with at or any queuing system that’s available always for help. this is indicative of memory and contention of swapping/paging problems. Check memory statistics again by running vmstat 5 (sar-rwpg). This points to CPU overload. Run the script through cron. fix memory problem first. Swapping makes performance worse.g. work must be piling up somewhere. %sys (system). object modules. %wio (waiting on I/O). you have memory problems. rwhod and routed are particularly likely to be performance problems. or buying another system to share the load. try the following actions: • • • Write a find script that detects old core dumps. replacing it. If you are using NFS and using remote files. Also. This may hurt your system’s memory performance. but any savings will help. and %idle (% of idle time). look at your network situation.4. also observe for CPU idle time. This provides the %usr (user). Identifying and Resolving Network Issues INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-153 . You don’t have local disk I/O problems.

but badxidis low. avoid long search paths. Look to see if there are CPU. Look at netsat-i. If not. If timeoutand retrans are high. the remote system most likely cannot respond to incoming data fast enough.You can suspect problems with network capacity or with data integrity if users experience slow performance when they are using rlogin or when they are accessing files via NFS. If badmixis roughly equal to timeout.intensive programs across the network. at least one NFS server is overloaded. reconfigure the kernel with more buffers. memory or disk I/O problems on the remote system. A large number of dropped packets may also indicate data corruption. Avoid ps. PAGE BP-154 BEST PRACTICES INFORMATICA CONFIDENTIAL . Instead. look for network errors. A large number of input errors indicate problems somewhere on the network. Try to prevent users from running I/O. suspect an overloaded network.0 or System V. If the increase of UDP socket full drops (as indicated by netstat) is equal to or greater than the number of drop packets that spray reports. suspect hardware problems. The greputility is a good example of an I/O intensive program. If collisions and network hardware are not a problem. the network may be faulty. then spray the remote system from the local system and run netstat-s again. figure out which system appears to be slow. Use spray to send a large burst of packets to the slow system. Use vi or a native window editor rather than emacs. If you use sh. If the number of dropped packets is large. Run nfsstat and look at the client RPC data. Run netstat-s on the remote system. General Tips and Summary of Other Useful Commands • • • • • Use dirs instead of pwd. Reorganize the computers and disks on your network so that as many users as possible can do as much work as possible on a local system. Try to reorganize the network so that this system isn’t a file server. the remote system is slow network server If the increase of socket full drops is less than the number of dropped packets. If timeout is high. the system may just not be able to tolerate heavy network workloads. at least one NFS server is overloaded. If the number of collisions is large. Minimize the number of files per directory. Use systems with good network performance as file servers. have users log into the remote system to do their work.3 (or earlier). or one or more servers may have crashed. some part of the network between the NFS client and server is overloaded and dropping packets. A large number of output errors suggests problems with your system and its interface to the network. If you are short of STREAMS data buffers and are running Sun OS 4. the network or an NFS server is overloaded. If the number of input or output errors is large. If the retransfield is more than 5 percent of calls.

Maxuproc is the setting to determine the maximum level of user background processes.powermar pm4 5000000 1421 2714 m 8003 00000000 --rw------.cfg 1 202 0:02 dtm pmserver.intensive applications across NFS. Use PMProcs Utility ( PowerCenter Utility). Avoid raw devices. in reality are not a file system at all. Don’t run grep or other I/O.cfg 1 202 - <-----------. Of particular attention is maxuproc. ufs. In general. Be sure to check the database vendor documentation to determine the best file system for the specific machine.• • • Use egrep rather than grep: it’s faster. this is defaulted to 40 but should be increased to 250 on most systems. For example: harmon 125: pmprocs <-----------. The “UNIX File System” derived from Berkeley (BSD).powermar pm4 25000000 2714 2714 <-----------.cfg 0 202 1:30 pmserver 0:08 dtm pmserver.Current Shared Memory Resources ---------------> IPC status from <running system> as of Tue Feb 16 18:13:55 1999 T ID KEY MODE OWNER GROUP SEGSZ CPID LPID Shared Memory: m 0 0x094e64a5 --rw-rw---.Current PowerMart processes ---------------> UID PID PPID C powermar 2711 1421 289406976 powermar 2713 2711 289406976 powermar 1421 1 powermar 2712 2711 289406976 powermar 2714 1421 289406976 powermar 2721 2714 289406976 powermar 2722 2714 289406976 STIME TTY TIME CMD 16 18:13:11 ? 0:07 dtm pmserver.oradba dba 21749760 1331 2478 m 202 00000000 --rw------.cfg 0 202 11 18:13:17 ? 1 08:39:19 ? 17 18:13:17 ? 11 18:13:20 ? 12 18:13:27 ? 8 18:13:27 ? 0:05 dtm pmserver.Current Semaphore Resources ---------------> INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-155 . The UNIX System V File System. lsattr –E –l sys0 is used to determine some current settings on most UNIX environments. Typical choices include: s5. On most UNIX environments. The Veritas File System. proprietary file systems from the UNIX vendor are most efficient and well suited for database work when tuned properly. to view the current Informatica processes.oracle dba 20979712 1254 1273 m 1 0x0927e9b2 --rw-rw---.powermar pm4 25000000 2711 2711 m 4 00000000 --rw------. and lastly raw devices that. Use rlogin rather than NFS to access files on remote systems.cfg 0 202 0:04 dtm pmserver. vxfs.cfg 1 202 0:04 dtm pmserver.

the general rule of thumb is to tune the server for a major database system. Because PowerCenter processes data in a similar fashion as SMP databases. For detailed information on each of the parameters discussed here and much more on performance tuning of the applications running on UNIX-based systems refer this book.Last PID that accessed the resource Semaphores .Creator PID LPID . is the main reference book for this Best Practice. PAGE BP-156 BEST PRACTICES INFORMATICA CONFIDENTIAL .3.3 running on AIX 4.shows slot in LM shared memory Finally. For example. when tuning UNIX environments. by tuning the server to support the database.used to sync the reader and writer 0 or 1 . you also tune the system for PowerCenter. there is a specific IBM Redbook for Oracle 7. References: System Performance Tuning (from O’Reilly Publishing) by Mike Loukid.There are 19 Semaphores held by PowerMart processes • • • • • • Pmprocs is a script that combines the ps and ipcs commands Only available for UNIX CPID . Most database systems provide a special tuning supplement for each specific version of UNIX.

look for these performance indicators to check: Processor: percent processor time. While some are likely to be more helpful than others in any particular environment. running at 100 percent for all CPUs). with differences for Windows 2000 noted in the last section. it may be necessary to add processing power to the server. Thus. If the system is “maxed out” (i. When using the Performance Monitor. However. a number of five pages per second or less is acceptable.e. NT scalability is quite limited. There is currently no solution for optimizing this situation. Also keep in mind NT’s inability to split processes across multiple CPUs.Performance Tuning Windows NT/2000 Systems Challenge The Microsoft Windows NT/2000 environment is easier to tune than UNIX environments. choose task manager. For SMP environments you need to add one monitor for each CPU. Description The two places to begin when tuning an NT server are: • • The Performance Monitor. and click on the Performance tab). one CPU may be at 100% utilization while the other CPUs are at 0% utilization. The following tips have proven useful in performance tuning NT-based machines. but offers limited performance options. Note: Tuning is essentially the same for both NT and 2000 based systems. Unfortunately. this does not mean that the NT system administrator is entirely free from performance improvement responsibilities. If the number is much higher. The Performance tab (hit ctrl+alt+del. although Microsoft is working on the problem. especially in comparison with UNIX environments. NT is considered a “selftuning” operating system because it attempts to configure and tune memory to the best of its ability. there is a need to tune the memory INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-157 . In this comparison. all are worthy of consideration. Memory: pages/second.

to make better use of hardware rather than virtual memory. If necessary. and hubstacks is critical for optimal server performance when moving data across the network. and some background processes. connections. Server: bytes total/second. level the load across the disk devices by moving files. can potentially starve the CPUs on the machine. can eliminate bottlenecks and improve throughput of network traffic at a magnitude of 10 to 1000 times depending on the hardware. Assume that some software will not be well coded. Careful analysis of the network card (or cards) and their settings. moving files to less frequently used disk devices should level the load of the disk device. Before adding memory. such as a mail server or web server running on the same machine. check the Services in Control Panel because many background applications do not uninstall the old service when installing a new update or version. Memory and services. Although adding memory to NT is always a good solution. Device Drivers. I/O Optimization. This is the best place to tune database performance within NT environments. Physical disks: percent time. This setting is used to determine the number of users sitting idle waiting for access to the same disk device. combined with the use of a Network Analyzer. files should be moved to less utilized disk devices to optimize overall performance. by far. making it difficult to identify real problems. PAGE BP-158 BEST PRACTICES INFORMATICA CONFIDENTIAL . It monitors the server network connection. and very possibly resulting in a false sense of security. Physical disks: queue length. Remember that this is only a guideline. The device drivers for some types of hardware are notorious for wasting CPU clock cycles. Off-loading CPU hogs may be the only recourse. the load on the database can be leveled across multiple disks. Load reasonableness. be sure to level the load across the controllers too. If this number is greater than two. Intimate knowledge of the network card. and the recommended setting may be too high for some systems. Thus. it is also expensive and usually must be planned to support the BANK system for EISA and PCI architectures. both the unused old service and the new service may be using valuable CPU memory resources. In situations where there are multiple controllers. By analyzing the disk I/O. Some connections may be fast while others are slow. It is nebulous because it bundles multiple network connections together. This is a very nebulous performance indicator. High I/O settings indicate possible contention for I/O. the best tuning option for database applications in the NT environment. Resolving Typical NT Problems The following paragraphs describe some common performance problems in an NT environment and suggest tuning solutions. Be sure to get the latest drivers from the hardware vendor to minimize this problem. This is.

NT. With Windows 2000. memory. to monitor the amount of system resources used by the Informatica server and to identify system bottlenecks.exe at the command prompt causes the system to start System Monitor. Finally. on NT servers. Typing perfmon. or thread activity. session execution. Monitoring System Performance In Windows 2000 In Windows 2000 the Informatica server uses system resources to process transformation. page faults. Counter logs record sampled data about hardware resources and system services based on performance objects and counters in the same manner as System Monitor. sets the disk device priority low. by default. processor. The alerting function allows you to define a counter value that will trigger actions such as sending a network message. or starting a log. and fragmentation can be eliminated by using a Windows NT/2000 disk defragmentation product. Windows 2000 provides the following tools (accessible under the Control Panel/Administration Tools/Performance) for monitoring resource usage on your computer: • • System Monitor Performance Logs and Alerts These Windows 2000 monitoring tools enable you to analyze usage and detect bottlenecks at the disk. This is useful in monitoring other systems that require administration.Using electrostatic devices and fast-wide SCSI can also help to increase performance. or system tools in the task manager. Change the disk priority setting in the Registry at service\lanman\server\parameters and add a key for ThreadPriority of type DWORD with a value of 2. The Performance Logs and Alerts tool provides two types of performance-related logs—counter logs and trace logs—and an alerting function. Using this type of product is a good idea whether the disk is formatted for FAT or NTFS. joiner. You can copy counter paths and settings from the System Monitor display to the Clipboard and paste counter paths from Web pages or other sources into the System Monitor display. not Performance Monitor. The Informatica server also uses system memory for other data such as aggregate. and network level. Therefore they can be viewed in System Monitor. and reading and writing of data. you can use system monitor in the Performance Console of the administrative tools. Trace logs collect event traces that measure performance statistics associated with events such as disk and file I/O. The System Monitor displays a graph which is flexible and configurable. rank. Data in counter logs can be saved as comma-separated or tab-separated files that are easily viewed with Excel. Also increase the priority of the disk devices on the NT server. The System Monitor is portable. be sure to implement disk stripping to split single data files across multiple disk drives and take advantage of RAID (Redundant Arrays of Inexpensive Disks) technology. Alerts are INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-159 . and cached lookup tables. running a program.

are configured to create a binary log that. PAGE BP-160 BEST PRACTICES INFORMATICA CONFIDENTIAL .useful if you are not actively monitoring a particular counter threshold value. Some other useful counters include Physical Disk: Reads/sec and Writes/sec and Memory: Available Bytes and Cache Bytes. If you start logging with these settings.) The predefined log settings under Counter Logs named System Overview. (The subkey is HKEY_CURRENT_MACHINE\SYSTEM\CurrentControlSet\Services\SysmonLog\Log_Qu eries. Note:You must have Full Control access to a subkey in the registry in order to create or modify a log configuration. Disk Queue Length. but want to be notified when it exceeds or falls below a specified value so that you can investigate and determine the cause of the change. after manual start-up. If you want to create your own log setting press the right mouse on one of the log types. PhysicalDisk(_Total)\Avg. You might want to set alerts based on established performance baseline values for your system. and Processor(_Total)\ % Processor Time. updates every 15 seconds and logs continuously until it achieves a maximum size. data is saved to the Perflogs folder on the root directory and includes the counters: Memory\ Pages/sec.

be sure the SQL statement is tuned. Consider Single-Pass Reading If several mappings use the same data source. This Best Practice offers some guidelines for tuning mappings. then back to an Integer port. the conversion may be unnecessary. Description Analyze mappings for tuning only after you have tuned the system. a single-pass reading will reduce the number of times that function will be called in the session. Similarly. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-161 .Tuning Mappings for Better Performance Challenge In general. if a function is used in several mappings. Consolidate separate mappings into one mapping with either a single Source Qualifier Transformation or one set of Source Qualifier Transformations as the data source for the separate data flows. Lookup Transformation. if a mapping moves data from an Integer port to a Decimal port. The extent to which and how SQL can be tuned depends on the underlying source or target database system. For example. a PowerCenter mapping is the biggest ‘bottleneck’ in the load process as business rules determine the number and complexity of transformations in a mapping. When these conversions are performed unnecessarily performance slows. or in the update override of a target object. Optimize SQL Overrides When SQL overrides are required in a Source Qualifier. Scrutinize Datatype Conversions PowerCenter Server automatically makes conversions between compatible datatypes. consider a single-pass reading. source and target for peak performance.

If errors recur consistently for certain transformations. datatype conversions can help improve performance. A better rule of thumb than memory size is to determine the ‘size’ of the potential lookup cache with regard to the number of rows expected to be processed. Any source of errors should be traced and eliminated. During transformation errors. and logs the error in the session log. Transformation errors can be caused by many things including: conversion errors. and so on. re-evaluate the constraints for these transformation. removes the row causing the error from the data flow. Practices regarding memory and cache sizing for Lookup transformations are covered in Best Practice: Tuning Sessions for Better Performance. This is especially true when integer values are used in place of other datatypes for performing comparisons using Lookup and Filter transformations. any condition that is specifically set up as an error. if the lookup table needs less than 300MB of memory. the source and lookup contain the following number of records: ITEMS (source): MANUFACTURER: DIM_ITEMS: Number of Disk Reads 5000 records 200 records 100000 records PAGE BP-162 BEST PRACTICES INFORMATICA CONFIDENTIAL . NOTE: All the tuning options mentioned in this Best Practice assume that memory and cache sizing for lookups are sufficient to ensure that caches will not page to disks.In some instances however. Eliminate Transformation Errors Large numbers of evaluation errors significantly slow performance of the PowerCenter Server. consider the following example. When this option is not enabled. In general. When to Cache Lookups When caching is enabled. the PowerCenter Server engine pauses to determine the cause of the error. In Mapping X. conflicting mapping logic. For example. lookup caching should be enabled. the PowerCenter Server caches the lookup table and queries the lookup cache during the session. the PowerCenter Server queries the lookup table on a row-by-row basis. The session log can help point out the cause of these errors. Optimize Lookup Transformations There are a number of ways to optimize lookup transformations that are setup in a mapping.

Cached Lookup LKP_Manufacturer Build Cache Read Source Records Execute Lookup Total # of Disk Reads LKP_DIM_ITEMS Build Cache Read Source Records Execute Lookup Total # of Disk Reads 100000 5000 0 105000 200 5000 0 5200 Un-cached Lookup 0 5000 5000 100000 0 5000 5000 10000 Consider the case where MANUFACTURER is the lookup table. 4. If your expected source records is more than X. 2. Run the mapping with caching turned on and save the log to a different name than the log created in step 3. In the non-cached log. In the cached log. If your expected source records is less than X. For example. If the lookup table is not cached. then it will take a total of 10. 7. Use the following formula to find the breakeven row point: (LS*NRS*CRS)/(CRS-NRS) = X Where X is the breakeven point. If the lookup table is cached. If the lookup table is not cached. Thus the lookup should not be cached. the number of records in the lookup table is small in comparison with the number of times the lookup is executed.000 total disk reads to execute the lookup. it will result in 105. In this case. Look in the cached lookup log and determine how long it takes to cache the lookup object. take the time from the last lookup cache to the end of the load in seconds and divide it into the number or rows being processed: NON-CACHED ROWS PER SECOND = NRS. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-163 . Run the mapping with caching turned off and save the log. Use the following eight step method to determine if a lookup should be cached: 1.000 rows. 3. Select a standard set of data from the source. 5. So this lookup should be cached. it is better to not cache the lookup. This is the more likely scenario. it will take a total of 5200 disk reads to build the cache and execute the lookup. Note this time in seconds: LOOKUP TIME IN SECONDS = LS.000. If the lookup table is cached. Consider the case where DIM_ITEMS is the lookup table.000 total disk reads to build and execute the lookup. In this case the number of records in the lookup table is not small in comparison with the number of times the lookup will be executed. then the disk reads would total 10. add a where clause on a relational source to load a sample 10. take the time from the last lookup cache to the end of the load in seconds and divide it into number or rows being processed: CACHED ROWS PER SECOND = CRS. it is better to cache the lookup. 6. Code the lookup into the mapping. 8.

Assume with a cached lookup the load is 232 rows per second (CRS=232). the memory cache created for the lookup during the initial run is saved to the PowerCenter Server. NOTE: If you use a SQL override in a lookup. the PowerCenter Server will re-use the cache for the multiple instances of the lookup. it may be better to setup the multiple lookups to bring back the same columns even though not all return ports are used in all lookups. if the same lookup is used multiple times in a mapping. the lookup must be cached. If the option of creating a persistent cache is set in the lookup properties. then the lookup should be cached. the use of a named persistent cache allows sharing of an existing cache file. Using the same lookup multiple times in the mapping will be more resource intensive with each successive instance. Assume with a non-cached lookup the load is 147 rows per second (NRS = 147).603. This can improve performance because the Server builds the memory cache from cache files instead of the database. The formula would result in: (166*147*232)/(232-147) = 66. Across sessions of the same mapping. Options can be added to the WHERE clause to reduce the set of records included in the resulting cache. the lookup should not be cached. set the conditions with an equal sign first in order to optimize lookup performance.603 records. if the source has less than 66. If multiple cached lookups are from the same table but are expected to return different columns of data. Across different mappings and sessions. If it has more than 66. Optimizing the Lookup Condition In the case where a lookup uses more than one lookup condition. Bringing back a common set of columns may reduce the number of disk reads. Sharing Lookup Caches There are a number of methods for sharing lookup caches. the use of an unnamed persistent cache allows multiple runs to use an existing cache file stored on the PowerCenter Server.603 records. • Within a specific session run for a mapping.For example: Assume the lookup takes 166 seconds to cache (LS=166). • • Reducing the Number of Cached Rows There is an option to use a SQL override in the creation of a lookup cache. This feature should only be used when the lookup table is not expected to change between session runs. PAGE BP-164 BEST PRACTICES INFORMATICA CONFIDENTIAL . Thus.

Without sorted input. the Server must wait for all rows of data before processing aggregate calculations. This reduces the number of transformations in the mapping and makes the mapping easier to follow. calculations can be performed and information passed on to the next transformation. ¨ In the case of a cached lookup. Filter transformations are most effective when a simple integer or TRUE/FALSE expression is used in the filter condition. Columns used in the ORDER BY condition should be indexed. Use simple columns in the group by condition to make the Aggregator Transformation more efficient. Also avoid complex expressions in the Aggregator expressions. since a SQL statement created for each row passing into the lookup transformation. performance can be helped by indexing columns in the lookup condition. Optimize Aggregator Transformations Aggregator Transformations often slow performance because they must group data before processing it. Avoid complex expressions when creating the filter condition. Use of the INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-165 . sort and compare values in the lookup condition columns. Filters or routers should also be used to drop rejected rows from an Update Strategy transformation if rejected rows do not need to be saved. The Sorted Input option decreases the use of aggregate caches. ¨ In the case of an un-cached lookup. Use the Sorted Input option in the aggregator. The session log will contain the ORDER BY statement. use a filter on the Source Qualifier or a Filter Transformation immediately after the source qualifier to improve performance. When it is used. Optimize Filter and Router Transformations Filtering data as early as possible in the data flow improves the efficiency of a mapping. as a group is passed through an aggregator. the PowerCenter Server assumes all data is sorted by group and. Instead of using a Filter Transformation to remove a sizeable number of rows in the middle or end of a mapping. When possible. indexes on the database table should include every column used in a lookup condition. an ORDER BY condition is issued in the SQL statement used to create the cache. especially in GROUP BY ports. This option requires that data sent to the aggregator be sorted in the order in which the ports are used in the aggregator’s group by. This can improve performance for both cached and un-cached lookups. use numbers instead of strings or dates in the GROUP BY columns. As a result. Replace multiple filter transformations with a router transformation.Indexing the Lookup Table The PowerCenter Server must query.

Thus.Sorted Inputs option is usually accompanied by a Source Qualifier which uses the Number of Sorted Ports option. Further. the smaller set of data should be cached and thus set as Master. Normal joins are faster than outer joins and the resulting set of data is also smaller. If it is set to cache no values then the Informatica Server must query the Informatica repository each time to determine what is the next number which can be used. In order to minimize memory requirements. making calls to external procedures slows down a session. which include Stored Procedures. The Master rows are cached to memory and the detail records are then compared to rows in the cache of the Master rows. Optimize Joiner Transformations Joiner transformations can slow performance because they need additional space in memory at run time to hold intermediate results. when it is called next time. thus increasing the Number of Cached Values property can increase performance. Avoid External Procedure Transformations For the most part. so a SQL override or a join condition should be used when joining multiple tables from the same database schema. to give the next set of cache values. then its data would be used to continue calculating the current group function. In the Expression Transformation. This property determines the number of values the Informatica Server caches at one time. Use an Expression and Update Strategy instead of an Aggregator Transformation. using this option assumes that a mapping is using an Aggregator with Sorted Input option. PAGE BP-166 BEST PRACTICES INFORMATICA CONFIDENTIAL . the use of variable ports is required to hold data from the previous row of data processed. avoid the use of these Transformations. An Update Strategy Transformation would follow the Expression Transformation and set the first row of a new group to insert and the following rows to update. Optimize Sequence Generator Transformations Sequence Generator transformations need to determine the next available sequence number. If possible. Database systems usually can perform the join more quickly than the Informatica Server. Use Normal joins whenever possible. It should be noted any cached values not used in the course of a session are ‘lost’ since the sequence generator value in the repository is set. Configuring the Number of Cached Values to a value greater than 1000 should be considered. This technique can only be used if the source data can be sorted. Define the rows from the smaller set of data in the joiner as the Master rows. Use the database to do the join when sourcing data from the same database schema. The premise is to use the previous row of data to determine whether the current row is a part of the current group or is the beginning of a new group. if the row is a part of the current group. External Procedures and Advanced External Procedures.

the Informatica Server must search and group the data. Minimize Function Calls Anytime a function is called it takes resources to process. When examining expressions. There are several common examples where function calls can be reduced or eliminated.Field Level Transformation Optimization As a final step in the tuning process. Use the target table mapping reports or the Metadata Reporter to examine the transformations. Run and time the edited session.20% performance improvement by optimizing complex field level transformations. Aggregate function calls can sometime be reduced. Keep in mind that there may be more than one field causing performance problems. If the transformation expressions are complex. a mapping has five target tables. expressions used in transformations can be tuned. Copy the mapping and replace half the complex expressions with a constant. For example. focus on complex expressions for possible simplification. Thus the following expression: SUM(Column A) + SUM(Column B) Can be optimized to: INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-167 . 5. then processing will be slower. Run and time the edited session. do the following: 1. Each target requires a Social Security Number lookup. Make another copy of the mapping and replace the other half of the complex expressions with a constant. 3. 2. If a mapping performs the same logic multiple times in a mapping. Instead of performing the lookup right before each target. Its often possible to get a 10. moving the task upstream in the mapping may allow the logic to be done just once. 4. To help isolate slow expressions. Time the session with the original expression. Likely candidates for optimization are the fields with the most complex expressions. move the lookup to a position before the data flow splits. Factoring out Common Logic This can reduce the number of times a mapping performs the same logic. In the case of each aggregate function call. Processing field level transformations takes time.

VAL_B. VAL_C. IIF(FLG_A=’N’ and FLG_B=’Y’ and FLG_C=’N’.’ ‘). IIF(FLG_A=’Y’ and FLG_B=’N’ and FLG_C=’N’. not just a logical test. VAL_A+VAL_C. IIF(FLG_A=’Y’ and FLG_B=’N’ and FLG_C=’Y’. 16 ANDs and 24 comparisons. VAL_A. The optimized expression results in 3 IIFs. For example: IIF(FLG_A=’Y’ and FLG_B=’Y’ and FLG_C=’Y’. Be creative in making expressions more efficient. so operators should be used whenever possible.0)))))))) Can be optimized to: IIF(FLG_A=’Y’.0) The original expression had 8 IIFs. IIF(FLG_A=’Y’ and FLG_B=’Y’ and FLG_C=’N’. 0. VAL_C. The following is an example of rework of an expression which eliminates three comparisons down to one: For example: IIF(X=1 OR X=5 OR X=9.0) + IIF(FLG_C=’Y’. 'yes'. 3 comparisons and two additions. 0. For example if you have an expression which involves a CONCAT function such as: CONCAT(CONCAT(FIRST_NAME. IIF(FLG_A=’N’ and FLG_B=’Y’ and FLG_C=’Y’. VAL_B.SUM(Column A + Column B) In general. VAL_A+VAL_B. operators are faster than functions. 'no') PAGE BP-168 BEST PRACTICES INFORMATICA CONFIDENTIAL .0) + IIF(FLG_B=’Y’. VAL_A+VAL_B+VAL_C. LAST_NAME) It can be optimized to: FIRST_NAME || ‘ ‘ || LAST_NAME Remember that IIF() is a function that returns a value. 0. IIF(FLG_A=’N’ and FLG_B=’N’ and FLG_C=’N’. VAL_A. VAL_B+VAL_C. 0. This allows many logical statements to be written in a more compact fashion. IIF(FLG_A=’N’ and FLG_B=’N’ and FLG_C=’Y’.

Optimizing Char-Char and Char-Varchar Comparisons When the Informatica Server performs comparisons between CHAR and VARCHAR columns. 'no') Calculate Once. Use Many Times Avoid calculating or testing the same value multiple times.Can be optimized to: IIF(MOD(X. configuring the lookup around EMPLOYEE_ID improves performance. unnecessary links between transformations should be removed to minimize the amount of data moved. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-169 . using DECODE may improve performance. consider making the subexpression a local variable. when looking up a small set of unchanging values. Along the same lines. EMPLOYEE_NAME and EMPLOYEE_ID. This is especially important with data being pulled from the Source Qualifier Transformation. the Informatica Server must lookup a table in the database. it slows each time it finds trailing blank spaces in the row. If the same subexpression is used several times in a transformation. The Treat CHAR as CHAR On Read option can be set in the Informatica Server setup so that the Informatica Server does not trim trailing spaces from the end of CHAR source fields. Use DECODE instead of LOOKUP When a LOOKUP function is used. Thus. As there is always overhead involved in moving data between transformations. the lookup values are incorporated into the expression itself so the Informatica Server does not need to lookup a separate table. For example. The local variable can be used only within the transformation but by calculating the variable only once can speed performance. 4) = 1. 'yes'. Choose Numeric versus String Operations The Informatica Server processes numeric operations faster than string operations. When a DECODE function is used. if a lookup is done on a large amount of data on two columns. Reduce the Number of Transformations in a Mapping Whenever possible the number of transformations should be reduced.

. Any value other than zero for these counters may indicate a bottleneck. When the PowerCenter Server creates memory caches. Caches The greatest area for improvement at the session level usually involves tweaking memory cache settings. Both index and data cache files can be created for the following transformations in a mapping: • • • • Aggregator transformation (without sorted ports) Joiner transformation Rank transformation Lookup transformation (with caching enabled) PAGE BP-170 BEST PRACTICES INFORMATICA CONFIDENTIAL . When performance details are collected for a session. Because index and data caches are created for each of these transformations. Rank and/or Lookup transformations can point to a session bottleneck.Tuning Sessions for Better Performance Challenge Running sessions is where ‘the pedal hits the metal’. you should review the sessions for performance optimization. While it is true that various specific session options can be modified to improve performance. both the index cache and data cache sizes may affect performance. Joiner. this should not be the major or only area of focus when implementing performance tuning. Joiner. A common misconception is that this is the area where most tuning should occur. Review the memory cache settings for sessions where the mappings contain any of these transformations. target database and mappings. information about readfromdisk and writetodisk counters for Aggregator. Description When you have finished optimizing the sources. Rank and Lookup Transformations use caches. it may also create cache files. The Aggregator. depending on the factors discussed in the following paragraphs.

The DTM runs out of cache memory and pages to the local cache files.dat. an aggregate data cache file would be named PMAGG31_19. The mapping contains a Lookup transformation that is configured to initialize the persistent lookup cache. try to configure the index and data cache sizes to store the appropriate amount of data in memory.dat or . The session fails if the local directory runs out of disk space.idx2. the PowerCenter Server creates multiple index and data files. When creating these files. If the PowerCenter Server requires more memory than the configured cache size. • • Allocate at least enough space to hold at least one row in each aggregate group. the DTM generally deletes the overflow index and data cache files. the PowerCenter Server writes a message in the session log indicating the cache file name and the transformation name. For example.idx. if disk space is a constraint. Since paging to disk can slow session performance. If a cache file handles more than 2 gigabytes of data. However. When a session is run. The naming convention used by the PowerCenter Server for these files is PM [type of widget] [generated number]. Remember that you only need to configure cache memory for an Aggregator transformation that does NOT use sorted ports. Informatica recommends that the cache directory be local to the PowerCenter Server. The number of index and data files is limited only by the amount of disk space available in the cache directory. such as PMAGG*. Cache files may also remain if the session does not complete successfully. You may encounter performance or reliability problems when you cache large quantities of data on a mapped or mounted drive. it stores the overflow values in these cache files. The cache directory may be changed however. The PowerCenter Server uses memory to process an INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-171 . index and data files may exist in the cache directory if the session is configured for either incremental aggregation or to use a persistent lookup cache. the PowerCenter Server appends a number to the end of the filename.idx1 and PMAGG*. o Aggregator Caches Keep the following items in mind when configuring the aggregate memory cache sizes. The PowerCenter Server writes to the index and data cache files during a session in the following cases: • • • • The mapping contains one or more Aggregator transformations. When a session completes. $PMCacheDir. and the Informatica Server runs the session for the first time. and the session is configured for incremental aggregation. Refer to Chapter 9: Session Caches in the Informatica Session and Server Guide for detailed information on determining cache sizes.The PowerCenter Server creates the index and data cache files by default in the PowerCenter Server variable directory. The DTM may create multiple files when processing large amounts of data. The mapping contains a Lookup transformation that is configured to use a persistent lookup cache.

you must be sure to set the Recache from Database option to ensure that the lookup cache files will be rebuilt. Also. When it is used. when the transformation is configured to not cache. Using a lookup cache can sometimes increase session performance. • Joiner Caches The source with fewer records should be specified as the master source because only the master source records are read into cache. which helps increase the performance of the join. the PowerCenter Server reads the rows from the detail source and performs the joins. regardless of whether the lookup table is cached or not. the PowerCenter Server queries the lookup table for each input row. Mappings that have sessions which use incremental aggregation should be set up so that only new detail records are read with each subsequent run. Lookup cache files are saved after a session which has a lookup that uses a persistent cache is run for the first time. • • PAGE BP-172 BEST PRACTICES INFORMATICA CONFIDENTIAL . When the Lookup transformation is not configured for caching. the PowerCenter Server uses this historical information to perform the incremental aggregation. When a session is run with a Joiner transformation. • Lookup Caches Several options can be explored when dealing with lookup transformation caches.• Aggregator transformation with sorted ports. bypassing the querying of the database for the lookup. the PowerCenter Server automatically aligns all data for joiner caches on an eight-byte boundary.idx and saves them to the cache directory. The PowerCenter Server names these files PMAGG*. Refer to Best Practice: Tuning Mappings for Better Performance to determine when lookups should be cached. However. Incremental aggregation can improve session performance. • Persistent caches should be used when lookup data is not expected to change often. the PowerCenter Server aligns all data for lookup caches on an eight-byte boundary which helps increase the performance of the lookup. Just like for a joiner. the PowerCenter Server saves index and data cache information to disk at the end of the session. The result of the Lookup query and processing is the same. After the memory caches are built. Lookup caching should be enabled for relatively small tables.dat and PMAGG*. These files are reused for subsequent runs. not cache memory. the PowerCenter Server reads all the rows from the master source and builds memory caches based on the master rows. The next time the session runs. the PowerCenter Server queries the lookup table instead of the lookup cache. If the lookup table changes.

the combined DTM buffer memory allocated for the sessions or batches must not exceed the total memory for the PowerCenter Server system. first determine the number of memory blocks the PowerCenter Server requires to initialize the session.000 bytes To configure these settings. • Increasing the DTM Buffer Pool Size The DTM Buffer Pool Size setting specifies the amount of memory the PowerCenter Server uses as DTM buffer memory. then it was not a factor in session performance.Allocating Buffer Memory When the PowerCenter Server initializes a session. This specifies the size of a memory block that is used to move data throughout the pipeline. which can improve performance during momentary slowdowns. If a session is part of a concurrent batch. When the DTM buffer memory is increased. Then you can calculate the buffer pool size and/or the buffer block size based on the default settings. Sessions that use a large number of source and targets may require additional memory blocks. the PowerCenter Server creates more buffer blocks. You can tweak session properties to increase the number of available memory blocks by adjusting: • • DTM Buffer Pool Size – the default setting is 12.000. Row size INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-173 . If there are XML sources and targets in the mappings. to create the required number of session blocks. each transformation. If a session’s performance details show low numbers for your source and target BufferInput_efficiency and BufferOutput_efficiency counters.000 bytes Default Buffer Block Size – the default size is 64. Each source. use the number of groups in the XML source or target in the total calculation for the total number of sources and targets. If you don’t see a significant performance increase after increasing DTM buffer memory. the total memory available on the PowerCenter Server needs to be evaluated. it allocates blocks of memory to hold source and target data. When the DTM buffer memory allocation is increased. you may modify the buffer block size by changing it in the Advanced Parameters section. increasing the DTM buffer pool size may improve performance. The PowerCenter Server uses DTM buffer memory to create the internal data structures and buffer blocks used to bring data into and out of the Server. Increasing DTM buffer memory allocation generally causes performance to improve initially and then level off. which results in different numbers of rows that can be fit into one memory block. and each target may have a different row size. • Optimizing the Buffer Block Size Within a session.

If there is a complex mapping with multiple sources. When increasing the commit interval at the session level. Informatica recommends that the size of the shared memory (which determines the number of buffers available to the session) should not be increased at all unless the mapping is “complex” (i..e. you can separate it into several simpler mappings with separate sources.. block size should be configured so that it can hold roughly 100 rows. their datatypes and precisions. The PowerCenter Server will spawn a Read and Write thread for each partition. it has been noted that simple mappings (i. Therefore. Partitioning allows you to break a single source into multiple sources and to run each in parallel. based on number of ports. more than 20 transformations). Increasing the Target Commit Interval One method of resolving target database bottlenecks is to increase the commit interval. mappings with only a few transformations) do not make the engine “CPU bound” .e. If there are independent sessions that use separate sources and mappings to populate different targets.4 CPUs for the first session. and a maximum of 1 CPU for each additional session. performance slows. Keep in mind that each partition will compete for the same resources (i. Each time the PowerCenter Server commits. Partitioning Sessions If large amounts of data are being processed with PowerCenter 5. Ideally. thus allowing for simultaneous reading. and therefore use a lot less processing power than a full CPU. you must remember to increase the size of the database rollback segments to accommodate this larger PAGE BP-174 BEST PRACTICES INFORMATICA CONFIDENTIAL . the smaller the commit interval. Also. plus or minus a factor of ten. processing. and CPU). the number of times the PowerCenter Server commits decreases and performance may improve. Also. so make sure that the hardware and memory are sufficient to support a parallel session. The default is 64K. This enables you to place the sessions for each of the mappings in a concurrent batch to be run in parallel. so it may need to be increased for optimal performance.x. The buffer block size does not become a factor in session performance until the number of rows falls below 10 or goes above 1000. This technique should only be employed on servers with multiple CPUs available. they can be placed in a concurrent batch and run at the same time. data can be processed in parallel with a single session by partitioning the source via the source qualifier.. When calculating this. disk. and writing. Each concurrent session will use a maximum of 1.is determined in the server. use the source or target with the largest row size.e. If you increase the commit interval. Running Concurrent Batches Performance can sometimes be improved by creating a concurrent batch to run several sessions in parallel on one PowerCenter Server. the more often the PowerCenter Server writes to the target database. the DTM buffer pool size is split among all partitions. and the slower the overall performance. memory.

session performance may be improved by disabling decimal arithmetic. you may set the tracing level to Verbose to see the flow of data between transformations. just increasing the commit interval without making the appropriate database changes may cause the session to fail part way through (you may get a database error like “unable to extend rollback segments” in Oracle). However. The Decimal datatype is a numeric datatype with a maximum precision of 28. This can decrease performance. those with a precision of greater than 28) can slow the PowerCenter Server. it must be configured so that the PowerCenter Server recognizes this datatype by selecting Enable Decimal Arithmetic in the session property sheet. you should see an increase in performance. Terse tracing should only be set if the sessions run without problems and session details are not required. If you increase both the commit interval and the database rollback segments. Note that the tracing level must be set to Normal in order to use the reject loading utility. In some cases though. However. Disabling Session Recovery You can improve performance by turning off session recovery. disabling decimal arithmetic may improve session performance.number of rows. if terse is not an acceptable level of detail. Do not use Verbose tracing INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-175 . the PowerCenter Server does not write error messages or row-level information for reject data. you may be able to improve performance by reducing the amount of data the PowerCenter Server writes to the session log. At this tracing level. The PowerCenter Server writes recovery information in the OPB_SRVR_RECOVERY table during each commit. One of the major reasons that Informatica has set the default commit interval to 10. you may want to consider leaving the tracing level at Normal and focus your efforts on reducing the number of transformation errors.000 is to accommodate the default rollback segment / extent size of most databases. this will significantly affect the session performance. Reducing Error Tracing If a session contains a large number of transformation errors.. set the tracing level to Terse. To reduce the amount of time spent writing to the session log file. The PowerCenter Server setup can be set to disable session recovery. As an additional debug option (beyond the PowerCenter Debugger). Disabling Decimal Arithmetic If a session runs with decimal arithmetic enabled. But be sure to weigh the importance of improved session performance against the ability to recover an incomplete session when considering this option.e. To use a high-precision Decimal datatype in a session. However. since reading and manipulating a highprecision datatype (i.

it makes sense to fix and prevent any recurring transformation errors. The session tracing level overrides any transformation-specific tracing levels within the mapping. Informatica does not recommend reducing error tracing as a long-term response to high levels of transformation errors. Always remember to switch tracing back to Normal after the testing is complete.except when testing sessions. PAGE BP-176 BEST PRACTICES INFORMATICA CONFIDENTIAL . Because there are only a handful of reasons why transformation errors occur.

It involves the following five steps: 1. an efficient method for determining where bottlenecks exist is crucial to good data warehouse management. then tune the copy before making changes to the original. Delete the temporary sessions upon completion of performance tuning. Document the change made to the mapping/and or session and the performance metrics achieved as a result of the change. You should be able to compare the session’s original performance with that of the tuned session’s performance. 5. 4. Implement only one change at a time and test for any performance improvements to gauge which tuning methods work most effectively in the environment. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-177 . Make appropriate tuning changes to mappings and/or sessions. 2. use a process of elimination. Make a temporary copy of the mapping and/or session that is to be tuned. 3. 4. Write Read Mapping Session System Before you begin. The actual execution time may be used as a performance metric. Carefully consider the following five areas to determine where bottlenecks exist. The swap method is very useful for determining the most common bottlenecks.Determining Bottlenecks Challenge Because there are many variables involved in identifying and rectifying performance bottlenecks. investigating each area in the order indicated: 1. you should establish an approach for identifying performance bottlenecks. Description The first step in performance tuning is to identify performance bottlenecks. 3. To begin. attempt to isolate the problem by running test sessions. 5. 2.

Write Bottlenecks Relational Targets The most common performance bottleneck occurs when the PowerCenter Server writes to a target database. This type of bottleneck can easily be identified with the following procedure: 1. If the local flat file is very large. and the target table. Create a test mapping that contains only the flat file source. Create a mapping and session that writes the source table data to a flat file. you can optimize the write process by dividing it among several physical drives. Create a session for the test mapping. you have a write bottleneck. Make a copy of the original session Configure the test session to write to a flat file If the session performance is significantly increased when writing to a flat file. PAGE BP-178 BEST PRACTICES INFORMATICA CONFIDENTIAL . Flat File Targets If the session targets a flat file. you have a read bottleneck. follow these steps: 1. Using a Database Query To identify a source bottlenecks by executing a read query directly against the source database. Measure the query execution time and the time it takes for the query to return the first row. Copy the read query directly from the session log. 2. 2. You can optimize session performance by writing to a flat file target local to the PowerCenter server. If the test session’s performance increases significantly. 3. Read Bottlenecks Relational Sources If the session reads from a relational source. 2. 3. Using a Test Session with a Flat File Source 1. the source qualifier. You may also use a database query to indicate if a read bottleneck exists. you should first use a read test session with a flat file as the source in the test session. you probably do not have a write bottleneck. Run the query against the source database with a query tool such as SQL Plus.

source qualifiers. you probably do not have a read bottleneck. After using the swap method. you have a source bottleneck. Low Buffer Input and Buffer Output Counters INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-179 . and any custom joins or queries 3. Flat File Sources If your session reads from a flat file source. In the copied mapping. Ensure the flat file source is local to the PowerCenter Server. Tuning the Line Sequential Buffer Length to a size large enough to hold approximately four to eight rows of data at a time (for flat files) may help when reading flat file sources. Mapping Bottlenecks If you have eliminated the reading and writing of data as bottlenecks. You may improve session performance by locating the largest lookup tables and tuning those lookup expressions. For further details on eliminating mapping bottlenecks. High Rowsinlookupcache counters: Multiple lookups can slow the session. Remove all transformations. retain only the sources. If a session has large numbers in any of the Transformation_errorrows counters. you may have a mapping bottleneck. Connect the source qualifiers to the target. you can use the session’s performance details to determine if mapping bottlenecks exist. Make a copy of the original mapping 2. you may improve performance by eliminating the errors. High Errorrows counters: Transformation errors affect session performance.If there is a long delay between the two time measurements. 4. Follow these steps to identify mapping bottlenecks: Using a Test Mapping without transformations 1. refer to the Best Practice: Tuning Mappings for Better Performance Session Bottlenecks Session performance details can be used to flag other problem areas in the session Advanced Options Parameters or in the mapping. High Rowsinlookupcache and Errorrows counters indicate mapping bottlenecks. Use the swap method to determine if the bottleneck is in the mapping.

Percentage reflecting how seldom the DTM waited for a free buffer when passing data to the writer. Percentage reflecting how seldom the DTM waited for a full buffer of data from the reader. For further information regarding system tuning. mapping. and session. refer to the Best Practices: Performance Tuning UNIX-Based Systems and Performance Tuning NT/2000-Based Systems. Rank. Transformation Source Qualifier and Normalizer Transformations Counters BufferInput_Efficiency Description Percentage reflecting how seldom the reader waited for a free buffer when passing data to the DTM. instead of using cached data. Aggregator.If the BufferInput_efficiency and BufferOutput_efficiency counters are low for all sources and targets. UNIX On UNIX. and Joiner Readfromdisk and Writetodisk Counters If a session contains Aggregator. Number of times the Informatica Server read from the index or data file on the local disk. For further details on eliminating session bottlenecks. Windows NT/2000 Use system tools such as the Performance tab in the Task Manager or the Performance Monitor to view CPU usage and total memory usage. System Bottlenecks After tuning the source. examine each Trasnformation_readfromdisk and Transformation_writetodisk counter. If these counters display any number other than zero. Note that these can only be found in the Session Performance Details file. or Joiner transformations. refer to the Best Practice: Tuning Sessions for Better Performance. target. you may also consider tuning the system hosting the PowerCenter Server. you can improve session performance by increasing the index and data cache sizes. Percentage reflecting how seldom the Informatica server waited for a full buffer of data from the reader. use system tools like vmstat and iostat to monitor such items as system performance and disk swapping actions. increasing the session DTM buffer pool size may improve performance. The following table details the Performance Counters that can be used to flag session and mapping bottlenecks. BufferOutput_Efficiency Target BufferInput_Efficiency BufferOutput_Efficiency Aggregator and Rank Aggregator/Rank_readfromdisk PAGE BP-180 BEST PRACTICES INFORMATICA CONFIDENTIAL . Rank.

The first set of counters refers to the master source. Number of rows in which the Infor matica Server encountered an error Note: The PowerCenter Server generates two sets of performance counters for a Joiner transformation. Number of times the Informatica Server read from the index or data file on the local disk. The second set of counters refers to the detail source. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-181 . instead of using cached data. Number of rows stored in the lookup cache.Transformations Aggregator/Rank_writetodisk Joiner Transformation (see Note below) Joiner_readfromdisk Joiner_writetodisk Lookup Transformation All Transformations Lookup_rowsinlookupcache Transformation_errorrows Number of times the Informatica server wrote to the index or data file on the local disk. The Joiner transformation does not generate output row counters associated with the master source. instead of using cached data. Number of times the Informatica server wrote to the index or data file on the local disk. instead of using cached data.

The user who attempts to log in using the normal ‘nonadministrator’ userid will be unable to start the PowerCenter Client tools. resolve potential missing or invalid license key issues and change the Server Manager Session Log Editor to your preferred editor. simply choose Import Registry from the Tools drop down menu. PAGE BP-182 BEST PRACTICES INFORMATICA CONFIDENTIAL .Advanced Client Configuration Options Challenge Setting the Registry in order to ensure consistent client installations. the software will display the message indicating that the license key is missing or invalid. and subsequently a user with a non-administrator ID attempts to run the tools. choose Export Registry from the Tools drop down menu. Resolving the Missing or Invalid License Key Issue The “missing or invalid license key” error occurs when attempting to install PowerCenter Client tools on NT 4. Instead. For all subsequent client installs. then use the Repository Manager to export that connection information to a file. the Administrator can create a single "official" set of data sources. Description Ensuring Consistent Data Source Names To ensure the use of consistent data source names for the same data sources across the domain. Solution • • From Repository Manager.0 or Windows 2000 with a userid other than ‘Administrator.’ This problem also occurs when the client software tools are installed under the Administrator account. You can then distribute this file and import the connection information for each client machine.

Solution • • While logged in as the installation user with administrator authority. prompting the user to enter the full path name of the editor to be used to view the logs. Select Registry --> Exit from the menu bar to save the entry.e. Move to registry path location: HKEY_CURRENT_USER Software\Informatica\PowerMart Client Tools\[CLIENT VERSION]\Server Manager\Session Files. i. Select the Log File Editor entry by double clicking on it. use regedt32 to edit the registry.exe. A window appears the first time a session log is viewed from the PowerCenter Server Manager. use regedt32 to go into the registry. select Security/Permissions. Replace the entry with the appropriate editor entry. typically WordPad. and grant read access to the users that should be permitted to use the PowerMart Client. Users often set this parameter incorrectly and must access the registry to change it.exe or Write.) Changing the Server Manager Session Log Editor The session log editor is not automatically determined when the PowerCenter Client tools are installed.Solution • • While logged in as the installation user with administrator authority. From the menu bar. (Note that the registry entries for both PowerMart and PowerCenter server and client tools are stored as PowerMart Server and PowerMart Client tools. select View Tree and Data. • • • INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-183 . Under HKEY_LOCAL_MACHINE open Software/Informatica/PowerMart Client Tools/. From the menu bar.

This is particularly effective if your mapping contains many target tables. or if the session employs constraint-based loading. Create a new String value with value name of 'ThrottleReader' and value data of '10'. it is best to remove the partitions before adjusting the throttle reader.x and above ONLY: If a session is hanging and it is partitioned.Advanced Server Configuration Options Challenge Configuring the Throttle Reader and File Debugging options. and configuring server variables. When a session is partitioned. If the session still hangs. Solution: To limit the number of reader buffers using Throttle Reader in NT/2000: • • Access file hkey_local_machine\system\currentcontrolset\services\powermart\parameter s\miscinfo. adjusting semaphore settings in the Unix environment. Note for PowerCenter 5.cfg file: ThrottleReader=10 PAGE BP-184 BEST PRACTICES INFORMATICA CONFIDENTIAL . some adjustments at the Server level can help to alleviate issues or isolate problems. This parameter closely manages buffer blocks in memory by restricting the number of blocks that can be utilized by the Reader. the server makes separate connections to the source and target for every partition. Description Configuring the Throttle Reader If problems occur when running sessions. To do the same thing in UNIX: • • Add this line to . try adjusting the throttle reader. One technique that often helps resolve “hanging” sessions is to limit the number of reader buffers that use Throttle Reader. This will cause the server to manage many buffer blocks.

Solaris: Use admintool or edit /etc/system to change the parameters. then add value 4. help technical support to resolve the issue by supplying them with Debug files. depending on the number of sessions the server runs concurrently.cfg file: • • • • DebugScrubber=4 DebugWriter=1 DebugReader=1 DebugDTM=1 Adjusting Semaphore Settings The UNIX version of the PowerCenter Server uses operating system semaphores for synchronization. Select edit. Select Start.Configuring File Debugging Options If problems occur when running sessions or if the PowerCenter Server has a stability issue. miscInfo 3. the machine may not boot. with a limit per user and system. if you set these parameters too high. Setting Shared Memory and Semaphore Parameters Informatica recommends setting the following parameters as high as possible for the operating system. Refer to the operating system documentation for parameter limits: INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-185 . system. Run. "DebugDTM" with all three set to "1" To do the same in UNIX: Insert the following entries in the pmserver. and type “regedit” 2. Most installations require between 64 and 128 available semaphores. The number of semaphores required to run a session is 7. powermart. AIX: Use smit to change the parameters. The method used to change the parameter depends on the operating system: • • • HP/UX: Use sam (1M) to change the parameters. services. The total number of available operating system semaphores is an operating system configuration parameter. You may need to increase these semaphore settings before installing the server. Go to hkey_local_machine. Repeat steps 4 and 5. This is in addition to any semaphores required by other software. To set the debug options on for NT/2000: 1. Insert "4" as the value 5. such as database servers. but use "DebugWriter". However. current_control_set. Place "DebugScrubber" as the value then hit OK. "DebugReader".

Number of semaphores in the system. Maximum number of semaphores in one semaphore set. Ease of switching sessions from one server machine to another without manually editing all the sessions to change directory paths. SEMMNI determines the number of semaphores that can be created at any one time. Configuring Server Variables One configuration best practice is to properly configure and leverage Server variables. Must be equal to the maximum number of processes. PAGE BP-186 BEST PRACTICES INFORMATICA CONFIDENTIAL . Number of shared memory identifiers. you might add the following lines to the Solaris /etc/system file to configure the UNIX kernel: set shmsys:shminfo_shmmax = 4294967295 set shmsys:shminfo_shmmin = 1 set shmsys:shminfo_shmmni = 100 set shmsys:shminfo_shmseg = 10 set semsys:shminfo_semmns = 200 set semsys:shminfo_semmni = 70 Always reboot the system after configuring the UNIX kernel. Benefits of using server variables: • • Ease of deployment from development environment to production environment. SEMMNS SEMMNI 200 70 SEMMSL equal to or greater than the value of the PROCESSES initialization parameter For example. Number of semaphore set identifiers in the system. Maximum number of shared memory segments that can be attached by a process. Minimum size in bytes of a shared memory segment.Parameter SHMMAX SHMMIN SHMMNI SHMSEG Recommended Value for Solaris 4294967295 1 100 10 Description Maximum size in bytes of a shared memory segment.

Note that this location may be different on every server. ‘/home/john/logs’. edit the server configuration to set or change the variables. This is in fact a primary purpose for utilizing variables. bad file directory. What if a variable is not referenced in the session or mapping? • The variable is just a convenience. If you remove any variable reference from the session or the widget attributes then the server does not use that variable. Approach In Server Manager.g. Server Variable $PMRootDir $PMSessionLogDir $PMBadFileDir $PMCacheDir $PMTargetFileDir $PMSourceFileDir $PMExtProcDir $PMSuccessEmailUser $PMFailureEmailUser $PMSessionLogCount $PMSessionErrorThreshold Value (no default – user must insert a path) $PMRootDir/SessLogs $PMRootDir/BadFiles $PMRootDir/Cache $PMRootDir/TargetFiles $PMRootDir/SourceFiles $PMRootDir/ExtProc (no default – user must insert a path) (no default – user must insert a path) 0 0 Where are these variables referenced? • • Server manager session editor: anywhere in the fields for session log directory. Each registered server has its own set of variables. then the logs are put in that location. then the session logs will instead be placed in the directory location as designated. External Procedure attribute for ‘Location’ Does every session and mapping have to use these variables (are they mandatory)? • No. e. not userextensible. (The variable $PMSessionLogDir will be unused so it does not matter what the value of the variable is set to). But if the session log directory field is changed to designate a specific location. The variable will be expanded only if it is explicitly referenced from another location.• All the variables are related to directory paths used by server. the user can choose to use it or not. etc. The list is fixed. If the session log directory is specified as $PMSessionLogDir. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-187 . Designer: Aggregator/Rank/Joiner attribute for ‘Cache Directory’.

PAGE BP-188 BEST PRACTICES INFORMATICA CONFIDENTIAL .

It also discusses some potential questions and pitfalls that may arise when migrating to Production. including the operating system and all of its components. Each session: • Represents an active task that performs data loading. PowerCenter provides session parameters that can be set to specify the amount of required shared memory per session. This shared memory setting is important. and will also be used to provide a level of performance that meets your needs. the database engine. Description This Best Practice provides general guidance for sizing computing environments. as it will dictate the amount of RAM required when running concurrent sessions. front-end engines. Uses up to 140% of CPU resources. Regardless of whether or not the server is shared. Please consult the appropriate PowerCenter manuals for explanation of these terms where necessary. Be sure to consider all mandatory server software components. Technical Information Before delving into key sizing questions. it will be necessary to research the requirements of these additional software components when estimating the size of the overall environment. Certain terms used within this Best Practice are specific to Informatica’s PowerCenter. Environmental configurations may very greatly with regard to hardware and software sizing. In addition to requirements for PowerCenter. • INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-189 . other applications may share the server. etc. let us review the PowerCenter engine and its associated resource needs. considering specific environmental and processing requirements. This is important to remember if sessions will be executed concurrently.Platform Sizing Challenge Determining the appropriate platform size to support PowerCenter. Sizing may not be an easy task because it may be necessary to configure a single server to support numerous applications.

if the following conditions exist. The space consumed is about the size of the data aggregated. The Performance Tuning section provides additional information on factors that typically affect session performance. Note: Sorting the input to aggregations will greatly reduce the need for memory. memory consumed depends on the size of the master. Use these estimates along with recommendations in the preceding Technical Information section to determine the required number of processors. Data does not need to be stripped to prevent head contention. The amount of memory can be calculated per session. lookups. because: • • • Lookup tables. or joins.• • Requires 20-30 MB of memory per session if there are no aggregations. cache the master table. Temporary space is not used like a database on disk. Key Questions The goal of this analysis is to size the machine so that the ETL processes can complete within the specified load window. result in memory consumption commensurate with the size of the tables involved. unless the cache requires it after filling system memory. Note: It may be helpful to refer to the Performance Tuning section in Phase 4 of the Informatica Methodology when determining memory settings. lookups. Refer to the Session and Server guide to determine the exact amount of memory necessary per session. May require additional memory for the caching of aggregations. Requires additional memory when caching for aggregation. However. and offers general guidance for estimating session resources. The PowerCenter engine: Requires 20-30 MB of memory for the main server engine for session coordination. lookups. aggregation. and disk space to achieve the required performance to meet the load window. memory. PAGE BP-190 BEST PRACTICES INFORMATICA CONFIDENTIAL . or heterogeneous data joins contained within the mapping. the volume of data moved per session. or joins. and heterogeneous joins. more memory is used if there are more groups. and the caching requirements for the session’s lookup tables. This includes all types of data such as flat files and database tables. Data is stored in incremental aggregation files for adding data to aggregates. Disk space is not a factor if the machine is dedicated exclusively to the server engine. when cached in full. disk space will need to be carefully considered: • • • • Data is staged to flat files on the PowerCenter server. Consider the following questions when estimating the required number of sessions. In a join. Aggregate caches store the individual groups.

if any. if the ETL processing is performed after business hours. via flat file processing or relational tables? What is the load strategy. to maximize throughput by reading and writing data in parallel? When considering the server engine size. or both? If data is being aggregated.. In an environment where PowerCenter runs in parallel with all of these tools. what is it? Have you decided on the PowerCenter server environment (hardware/operating system)? Is it possible for the PowerCenter server to be on the same machine as the target? How will information be accessed for reporting purposes (e. and how long do they take? What is the total volume of data that must be moved. cube. where possible. and load processes in place? If so. PowerCenter commonly runs on a server that also hosts a database engine plus query/analysis tools. ad-hoc query tool. what are the processes. aggregations. if necessary? How will the data be moved. is the data updated. etc. transform. incrementally loaded.g. or will all tables be truncated and reloaded? Will the data processing require staging areas? What is the load plan? Are there dependencies between facts and dimensions? How often will the data be refreshed? Will the refresh be scheduled at a certain time. enabling incremental load strategies? What is the size of the batch window that is available for the load? Does the load process populate detail data. in bytes? What is the largest table (bytes and rows)? Is there any key on this table that could be used to partition load sessions. other applications may be vying for server resources. run on the PowerCenter server? Has the database table space been distributed across controllers. the query/analysis tool often drives the hardware requirements. or driven by external events? Is there a "modified" timestamp on the source table rows.) and what tools will you use to implement this access? What other applications or services. or will they be accessed via a network connection? What kind of network connection exists? Have you decided on the target environment (database/hardware/operating system)? If so.Please note that the hardware sizing analysis is highly dependent on the environment in which the server is deployed. It is very important to understand the performance characteristics of the environment before making any sizing conclusions. what is the ratio of source/target rows for the largest result set? How large is the result set (bytes and rows)? INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-191 . consider platform size in light of the following questions: • • • • • • • • • • • • • • • • • • • • • • • • What sources are accessed by the mappings? How do you currently access those sources? Do the sources reside locally. However. It is vitally important to remember that in addition to PowerCenter. the query/analysis tool requirements may not impose a sizing limitation. answer the following questions: Are there currently extract. With these additional processing requirements in mind.

This processor handled just under 20. To simplify the analysis. Four sessions ran after the set of 22. results will definitely vary by installation because each environment has a unique architecture and unique data characteristics.1. joining several sources and utilizing several Expression. A Sample Performance Result The following is a testimonial from a customer configuration.tpc. "critical path" jobs that drive the resource requirement. The source and target were both hosted locally on the ETL Server. Please note that these performance tests were run on a previous version of PowerCenter. Lookup and Aggregation transformations. focus on large. populating various summarization tables based on the product sales table. This website contains benchmarking reports that will help you fine tune your environment and may assist in determining processing power required. However.5 million rows. Links The following link may prove helpful when determining the platform size:www. which did not include the performance and functional enhancements in release 5. 22 sessions ran in parallel. All of the mappings were complex. These results are offered as one example of throughput. The source and target database used in the tests was Oracle. The performance tests were performed on a 4-processor Sun E4500 with 2GB of memory. and more than 2. PAGE BP-192 BEST PRACTICES INFORMATICA CONFIDENTIAL .8GB of data. populating a large product sales table.org. In this test scenario.The answers to these questions will provide insight into the factors that impact PowerCenter's resource requirements. in less than 54 minutes.

when you run the session in recovery mode. as if the session completed successfully with one run. Rather than processing the first half of the source again. But that is not the only option. you can tell the server to keep data already committed to the target database and process the rest of the source. the server cannot recover sessions configured to bulk load targets. Description When a network or other problem causes a session whose source contains a million rows to fail after only half of the rows are committed to the target. the server reads all source tables. Since bulk loading bypasses database logging. the server cannot perform recovery. the server can recover the same session more than once. When you run a session in recovery mode. The server can only perform recovery on relational tables. For example. The session is configured for a normal (not bulk) target load. Although recovering a large session can be more efficient • INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-193 . This is called nested recovery. If a session writing to file targets fails. and then passes data to the Data Transformation Manager (DTM) starting from row 1001. The server uses database logging to perform recovery. but only processes from the subsequent row id. delete the files. When necessary. if the server commits 1000 rows before the session fails. The server then reads all sources again. If the session has file targets. the server notes the row id of the last row committed to the target database. That is. and run the session again. The server can recover committed target data if the following three criteria are met: • All session targets are relational.Running Sessions in Recovery Mode Challenge Use PowerCenter standard functionality to recover data that is committed to a session's targets. even if the session does not complete. if a session fails while running in recovery mode. This results in accurate and complete target data. This technique is called performing recovery. one option is to truncate the target and run the session again from the beginning. you can re-run the session in recovery mode until the session completes successfully.

Reject Files When performing recovery. If the session is not configured to archive session logs. weigh the importance of performing recovery when choosing a target load type. When you configure a session to load in bulk. Session Logs If a session is configured to archive session logs. Example Session “s_recovery” reads from a Sybase source and writes to a target table in “production_target”. the following must be true: • • Source data does not change before performing recovery. the server logs a message in the session log stating that recovery is not supported. the server does not create the OPB_SRVR_RECOVERY table in the target database to store recovery-related information. and the Normalizer generates primary keys. updating. bulk loading increases general session performance. The server appends rejected rows from the recovery session (or sessions) to the session reject file. When configuring session properties for sessions processing large amounts of data. the server creates a new session log for the recovery session. a Microsoft SQL Server database. This session is configured for a normal load. If the table already exists. When the Disable Recovery option is checked. the server does not write information to that table. the server creates a new log for each session run. Therefore. This includes inserting.than running the session again. This allows you to correct and load all rejected rows from the completed session. • The server configuration parameter Disable Recovery is not selected. If you perform nested recovery. to ensure accurate results from the recovery. and deleting source data. PAGE BP-194 BEST PRACTICES INFORMATICA CONFIDENTIAL . Both the Sequence Generator and the Normalizer transformations generate source values: the Sequence Generator generates sequences. In addition. sessions using these transformations are not guaranteed to return the same values when performing recovery. the server overwrites the existing log when you recover the session. Changes in source files or tables can result in inaccurate data. The mapping consists of: Source Qualifier: SQ_LINEITEM Expression transformation: EXP_TRANS Target: T_LINEITEM The session is configured to save 5 session logs. The mapping used in the session does not use a Sequence Generator or Normalizer. the server creates a single reject file.

bulk mode [OFF] .bad. The following section of the session log shows the server preparing to load normally to the production_target database.. it creates the table. CMN_1039 SQL Server Event CMN_1039 [01/14/99 18:42:44 SQL Server Message 208 : Invalid object name 'OPB_SRVR_RECOVERY'. As the following session log show.First Run The first time the session runs. the server creates a session log named s_recovery. CMN_1022 [Function Name : Execute SqlStmt : SELECT SESSION_ID FROM OPB_SRVR_RECOVERY] WRT_8017 Created OPB_SRVR_RECOVERY table in target database.. Start loading table [T_LINEITEM] at: Thu Jan 14 18:42:50 1999 TARGET BASED COMMIT POINT Thu Jan 14 18:43:59 1999 ============================================= Table: T_LINEITEM Rows Output: 10125 INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-195 . CMN_1053 Writer: Target is database [TOMDB@PRODUCTION_TARGET]. the server appends the date and time to the log file name.] Thu Jan 14 18:42:44 1999 CMN_1040 SQL Server Event CMN_1040 [01/14/99 18:42:44 DB-Library Error 10007 : General SQL Server error: Check messages from the SQL Server.. (If the session is configured to save logs by timestamp.) The server also creates a reject file for the target table named t_lineitem.. TM_6095 Starting Transformation Engine. the server performs six target -based commits before the session fails..] Thu Jan 14 18:42:44 1999 CMN_1022 Database driver error.log. Since the server cannot find OPB_SRVR_RECOVERY.. user [lchen].

Rows Applied: 10125 Rows Rejected: 0 TARGET BASED COMMIT POINT Thu Jan 14 18:45:09 1999 ============================================= Table: T_LINEITEM Rows Output: 20250 Rows Applied: 20250 Rows Rejected: 0 TARGET BASED COMMIT POINT Thu Jan 14 18:46:25 1999 ============================================= Table: T_LINEITEM Rows Output: 30375 Rows Applied: 30375 Rows Rejected: 0 TARGET BASED COMMIT POINT Thu Jan 14 18:47:31 1999 ============================================= Table: T_LINEITEM Rows Output: 40500 Rows Applied: 40500 Rows Rejected: 0 TARGET BASED COMMIT POINT Thu Jan 14 18:48:35 1999 ============================================= Table: T_LINEITEM Rows Output: 50625 Rows Applied: 50625 Rows Rejected: 0 TARGET BASED COMMIT POINT Thu Jan 14 18:49:41 1999 ============================================= PAGE BP-196 BEST PRACTICES INFORMATICA CONFIDENTIAL .

bad) and appends any rejected rows to that file. In the session log below. the server creates a new session log.0. row 60751. since the server committed more than 60.) TM_6098 Session [s_recovery] running in recovery mode. check the Perform Recovery option on the Log Files tab of the session property sheet. 60752. the server provides more detailed information about the session. and states the row at which it will begin recovery (i.000 rows to the target. edit the session schedule and reschedule the session.. or choose Save Session Log By Timestamp option on the Log Files tab. you can configure the session to recover the committed rows. and writes all new session information in s_recovery. When performing recovery. rather than running the whole session again. When running the session with the Verbose Data tracing level. … TM_6026 Recovering from row [60751] for target instance [T_LINEITEM]. the server reads the source. It opens the existing reject file and begins processing with the next row. it renames the existing log s_recovery. or if necessary. either increase the number of session logs saved. Start the session. you can truncate the target and run the entire session again. The server reopens the existing reject file (t_lineitem. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-197 . and then passes data to the DTM beginning with the first uncommitted row. the server sets row 60751 as the row from which to recover. However.e.log. As seen below. Second Run (Recovery Session) When you run the session in recovery mode. Running a Recovery Session To run a recovery session.log. Note: Setting the tracing level to Verbose Data slows the server's performance and is not recommended for most production sessions. To archive the existing session log.Table: T_LINEITEM Rows Output: 60750 Rows Applied: 60750 Rows Rejected: 0 When a session fails. Since the session is configured to save multiple logs. the server notes the session is in recovery mode.

1. return the session to its normal schedule and reschedule the session. PAGE BP-198 BEST PRACTICES INFORMATICA CONFIDENTIAL . When the server completes loading target tables.bad] Third Run (Nested Recovery) If the recovery session fails before completing.CMN_1053 SetRecoveryInfo for transform(T_LINEITEM): Rows To Recover From = [60751]: CMN_1053 Current Transform [SQ_lineitem]: Rows To Consume From = [60751]: CMN_1053 Output Transform [EXPTRANS]: Rows To Produce From = [60751]: CMN_1053 Current Transform [EXPTRANS]: Rows To Consume From = [60751]: CMN_1053 Output Transform [T_LINEITEM]: Rows To Produce From = [60751]: CMN_1053 Writer: Opened bad (reject) file [C:\winnt\system32\BadFiles\t_lineitem. as if the session completed in a single run. If necessary. it performs any configured postsession stored procedures or commands normally. You can run the session in recovery mode as many times as necessary to complete the session's target tables. you must edit the session properties to clear the Perform Recovery option. This means the OPB_SRVR_RECOVERY table will not be created. the DisableRecovery server initialization flag defaults to Yes. you can run the session in recovery mode again. Things to Consider In PowerCenter 5. Returning to Normal Session After successfully recovering a session. and ‘Perform Recovery’ will not be possible unless this flag is changed to No during server configuration. You will need to have “create table” permissions in the target database in order to create this table. creating a new session log and appending bad data to the reject file. The server runs the session as it did the earlier recovery sessions.

The next step in creating the project scope is defining the business goals and objectives for the project and detailing them in a comprehensive Statement of Project Goals and Objectives. In many cases. and their business information requirements. the Project Sponsor and beneficiaries are the best sources for this type of information. using business terms to describe the problem.Interview project sponsor to identify beneficiaries. Deliverable . For example. The best way to gather this type of information is by interviewing the Project Sponsor and/or the project beneficiaries. The next step in establishing the business scope is to understand the business problem or need that the project addresses. what strategic or tactical benefits does the business expect to gain from the project. using individual interviews or general meetings to elicit the information.Problem/Need Statement 3. • Activity .g. Deliverable . One of the first steps in establishing the business scope is identifying the project beneficiaries and understanding their business roles and project participation. the problem may be expressed as "a lack of information" rather than "a lack of technology" and should detail the business decisions or analysis that is required to resolve the lack of information. • • Activity . is key to defining and scoping the project. 2..Developing the Business Case Challenge Identifying the departments and individuals that are likely to benefit directly from the project implementation. the Project Sponsor can help to identify the beneficiaries and the various departments they represent.Organization chart of corporate beneficiaries and participants. • • Activity .Interview (individually or in forum) Project Sponsor and/or beneficiaries regarding problems and needs related to project. It may be practical to combine information gathering for the needs assessment and goals definition. Again. Understanding these individuals. This statement should be a high-level expression of the desired business solution (e. Description The following four steps summarize business case development and lay a good foundation f or proceeding into detailed business requirements for the project. This information can then be summarized in an organization chart that is useful for ensuring that all project team members understand the corporate/business organization. This information should be clearly defined in a Problem/Needs Statement.Interview (individually or in forum) Project Sponsor and/or beneficiaries regarding business goals and objectives for the project. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-199 . 1. define their business roles and project participation.) and should avoid any technical considerations at this point.

Deliverable - Statement of Project Goals and Objectives

4. The final step is creating a Project Scope and Assumptions statement that clearly defines the boundaries of the project based on the Statement of Project Goals and Objective and the associated project assumptions. This statement should focus on the type of information or analysis that will be included in the project rather than what will not. The assumptions statements are optional and may include qualifiers on the scope, such as ass umptions of feasibility, specific roles and responsibilities, or availability of resources or data.

• •

Activity - Business Analyst develops Project Scope and Assumptions statement for presentation to the Project Sponsor.
Deliverable - Project Scope and Assumptions statement

PAGE BP-200

BEST PRACTICES

INFORMATICA CONFIDENTIAL

Assessing the Business Case

Challenge
Developing a solid business case for the project that includes both the tangible and intangible potential benefits of the project.

Description
The Business Case should include both qualitative and quantitative assessments of the project. The Qualitative Assessment portion of the Business Case is based on the Statement of Problem/Need and the Statement of Project Goals and Objectives (both generated in Subtask 1.1.1) and focuses on d iscussions with the project beneficiaries of expected benefits in terms of problem alleviation, cost savings or controls, and increased efficiencies and opportunities. The Quantitative Assessment portion of the Business Case provides specific measurable details of the proposed project, such as the estimated ROI, which may involve the following calculations:

Cash flow analysis- Projects positive and negative cash flows for the anticipated life of the project. Typically, ROI measurements use the cash flow formula to depict results. Net present value - Evaluates cash flow according to the long-term value of current investment. Net present value shows how much capital needs to be invested currently, at an assumed interest rate, in order to create a stream of payments over time. For instance, to generate an income stream of $500 per month over six months at an interest rate of eight percent would require an investment-a net present value-of $2,311.44. Return on investment - Calculates net present value of total incremental cost savings and revenue divided by the net present value of total costs multiplied by 100. This type of ROI calculation is frequently referred to as return of equity or return on capital employed. Payback - Determines how much time will pass before an initial capital investment is recovered.

The following are steps to calculate the quantitative business case or ROI:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

PAGE BP-201

Step 1. Develop Enterprise Deployment Map. This is a model of the project phases over a timeline, estimating as specifically as possible customer participation (e.g., by department and location), subject area and type of information/analysis, numbers of users, numbers of data marts and data sources, types of sources, and size of data set. Step 2. Analyze Potential Benefits. Discussions with representative managers and users or the Project Sponsor should reveal the tangible and intangible benefits of the project. The most effective format for presenting this analysis is often a "before" and "after" format that compares the current situation to the project expectations. Step 3. Calculate Net Present Value for all Benefits. Information gathered in this step should help the customer representatives to understand how the expected benefits will be allocated throughout the organization over time, using the enterprise deployment map as a guide. Step 4. Define Overall Costs. Customers need specific cost information in order to assess the dollar impact of the project. Cost estimates should address the following fundamental cost components:

• • • • • • • • •

Hardware Networks RDBMS software Back-end tools Query/reporting tools Internal labor External labor Ongoing support Training

Step 5. Calculate Net Present Value for all Costs. Use either actual cost estimates or percentage-of-cost values (based on cost allocation assumptions) to calculate costs for each cost component, projected over the timeline of the enterprise deployment map. Actual cost estimates are more accurate than percentage-of-cost allocations, but much more time-consuming. The percentage-of-cost allocation process may be valuable for initial ROI snapshots until costs can be more clearly predicted. Step 6. Assess Risk, Adjust Costs and Benefits Accordingly. Review potential risks to the project and make corresponding adjustments to the costs and/or benefits. Some of the major risks to consider are:

• • • •

Scope creep, which can be mitigated by thorough planning and tight project scope Integration complexity, which can be reduced by standardizing on vendors with integrated product sets or open architectures Architectural strategy that is inappropriate Other miscellaneous risks from management or end users who may withhold project support; from the entanglements of internal politics; and from technologies that don't function as promised

Step 7. Determine Overall ROI. When all other portions of the business case are complete, calculate the project's "bottom line". Determining the overall ROI is simply a matter of subtracting net present value of total costs from net present value of (total incremental revenue plus cost savings). For more detail on these steps, refer to the Informatica White Paper: 7 Steps to Calculating Data Warehousing ROI.

PAGE BP-202

BEST PRACTICES

INFORMATICA CONFIDENTIAL

Defining and Prioritizing Requirements

Challenge
Defining and prioritizing business and functional requirements is often accomplished through a combination of interviews and facilitated meetings (i.e., workshops) between the Project Sponsor and beneficiaries and the Project Manager and Business Analyst.

Description
The following three steps are key for successfully defining and prioritizing requirements:

Step 1: Discovery
During individual (or small group) interviews with high-level management, there is often focus and clarity of vision that for some, may be hindered in large meetings or not available from lower-level management. On the other hand, detailed review of existing reports and current analysis from the company's "information providers" can fill in helpful details. As part of the initial "discovery" process, Informatica generally recommends several interviews at the Project Sponsor and/or upper management level and a few with those acquainted with current reporting and analysis processes. A few peer group forums can also be valuable. However, this part of the process must be focused and brief or it can become unwieldy as much time can be expended trying to coordinate calendars between worthy forum participants. Set a time period and target list of participants with the Project Sponsor, but avoid lengthening the process if some participants aren't available. Questioning during these session should include the following:

• • • •

What are the target business functions, roles, and responsibilities? What are the key relevant business strategies, decisions, and processes (in brief)? What information is important to drive, support, and measure success for those strategies/processes? What key metrics? What dimensions for those metrics? What current reporting and analysis is applicable? Who provides it? How is it presented? How is it used?

Step 2: Validation and Prioritization
The Business Analyst, with the help of the Project Architect, documents the findings of the discovery process. The resulting Business Requirements Specification includes a matrix linking the specific business requirements to their functional requirements.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

PAGE BP-203

PAGE BP-204 BEST PRACTICES INFORMATICA CONFIDENTIAL . the Architect begins the Functional Requirements Specification providing details on the technical requirements for the project. Business Analyst. will facilitate discussion of informational details and provide the starting point for the target model definition. This document. the Project Manager. they develop a phased. Thus. The detailed business requirements and information requirements should be reviewed with the project beneficiaries and prioritized based on business need and the stated project objectives and scope. As general technical feasibility is compared to the prioritization from Step 2.At this time also. Step 3: The Incremental Roadmap Concurrent with the validation of the business requirements. "roadmap" for the project (Project Roadmap). This is presented to the Project Sponsor for approval and becomes the first "Increment" or starting point for the Project Plan. the Architect develops the Information Requirements Specification in order to clearly represent the structure of the information requirements. Items of secondary priority and those with poor near-term feasibility are relegated to subsequent phases of the project. based on the business requirements findings. and Architect develop consensus on a project "phasing" approach. or incremental.

The WBS serves as a starting point for both the project estimate and the project plan. Refer to Developing and Maintaining the Project Plan for further information about the project plan. Many projects will require the addition of detailed steps to accurately represent the development effort. individual resources can be assigned and scheduled. This sample is a Microsoft Project file that has been "pre-loaded" with the Phases. but should review it carefully to ensure that it corresponds to the specific development effort. the Project Manager can begin to estimate the level of effort involved in completing each of the steps. One general guideline is to keep task detail to a duration of at least a day. we may have multiple subtasks under a task (e.7 under task 4.3.1 through 4.and should . For example. it is not necessary to determine the critical path for completing these tasks. the BUILD phase is not complete until tasks 4. However. and Subtasks that make up the Informatica Methodology. So. an Excel version of the Work Breakdown Structure is available. removing any steps that aren't relevant or adding steps as necessary.1 through 4. At this stage of project planning. Description A WBS is a tool for identifying and organizing the tasks that need to be completed in a project. Tasks. If the Project Manager chooses not to use Microsoft Project. but some work can (and should) begin for the DEPLOY phase long before the BUILD phase is complete.3.4 may have sequential requirements that force us to complete them in order.3. and subtasks can be exported from Excel into many other project management tools. subtasks 4. simplifying the effort to develop the WBS.3. The end result is the Project Plan. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-205 .3). It is also important to remember that the WBS is not necessarily a sequential document. The Project Plan provides a starting point for further development of the project WBS. After the WBS has been loaded into the selected project management tool and refined for the specific project needs. Tasks in the hierarchy are often completed in parallel. but it does need to break the tasks down to a manageable level of detail.be completed in parallel if they do not have sequential requirements. although subtasks 4.5 through 4.Developing a WBS Challenge Developing a comprehensive work breakdown structure that clearly depicts all of the various tasks. accurate WBS. When the estimate is complete. and too much detail.7 can . The phases. subtasks required to complete the project. One challenge in developing a good WBS is obtaining the correct balance between enough detail.1 through 4. For example. 4.g. it is critical to develop a thorough.7 are complete.3. The WBS shouldn't be a 'grocery list' of every minor detail in the project. it is important to remember that a task is not complete until all of its corresponding subtasks are completed whether sequentially or in parallel. the goal is to list every task that must be completed.. tasks.3. The Project Manager can use this WBS as a starting point. Because project time and resource estimates are typically based on the Work Breakdown Structure (WBS).

Developing and Maintaining the Project Plan

Challenge
Developing the first-pass of a project plan that incorporates all of the necessary components but which is sufficiently flexible to accept the inevitable changes.

Description
Use the following steps as a guide for developing the initial project plan:

• • •

Define the project's major milestones based on the Project Scope. Break the milestones down into major tasks and activities. The Project Plan should be helpful as a starting point or for recommending tasks for inclusion. Continue the detail breakdown, if possible, to a level at which tasks are of about one to three days' duration. This level provides satisfactory detail to facilitate estimation and tracking. If the detail tasks are too broad in scope, estimates are much less likely to be accurate. Confer with technical personnel to review the task definitions and effort estimates (or even to help define them, if applicable). Establish the dependencies among tasks, where one task cannot be started until another is completed (or must start or complete concurrently with another). Define the resources based on the role definitions and estimated number of resources needed for each role. Assign resources to each task. If a resource will only be part-time on a task, indicate this in the plan.

• • • •

At this point, especially when using Microsoft Project, it is advisable to create dependencies (i.e., predecessor relationships) between tasks assigned to the same resource in order to indicate the sequence of that person's activities. The initial definition of tasks and effort and the resulting schedule should be an exercise in pragmatic feasibility unfettered by concerns about ideal completion dates. In other words, be as realistic as possible in your initial estimations, even if the resulting scheduling is likely to be a hard sell to c ompany management. This initial schedule becomes a starting point. Expect to review and rework it, perhaps several times. Look for opportunities for parallel activities, perhaps adding resources, if necessary, to improve the schedule.

PAGE BP-206

BEST PRACTICES

INFORMATICA CONFIDENTIAL

When a satisfactory initial plan is complete, review it with the Project Sponsor and discuss the assumptions, dependencies, assignments, milestone dates, and such. Expect to modify the plan as a result of this review.

Reviewing and Revising the Project Plan
Once the Project Sponsor and company managers agree to the initial plan, it becomes the basis for assigning tasks to individuals on the project team and for setting expectations regarding delivery dates. The planning activity then shifts to tracking tasks against the schedule and updating the plan based on status and changes to assumptions. One approach is to establish a baseline schedule (and budget, if applicable) and then track changes against it. With Microsoft Project, this involves creating a "Baseline" that remain s static as changes are applied to the schedule. If company and project management do not require tracking against a baseline, simply maintain the plan through updates without a baseline. Regular status reporting should include any changes to the schedule, beginning with team members' notification that dates for task completions are likely to change or have already been exceeded. These status report updates should trigger a regular plan update so that project management can track the effect on the overall schedule and budget. Be sure to evaluate any changes to scope (see 1.2.4 Manage Project and Scope Change Assessment ), or changes in priority or approach, as they arise to determine if they impact the plan. It may be necessary to modify the plan if changes in scope or priority require rearranging task assignments or delivery sequences, or if they add new tasks or postpone existing ones.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

PAGE BP-207

Managing the Project Lifecycle

Challenge
Providing a structure for on-going management throughout the project lifecycle.

Description
It is important to remember that the quality of a project can be directly correlated to the amount of review that occurs during its lifecycle.

Project Status and Plan Reviews
In addition to the initial project plan review with the Project Sponsor, schedule regular status meetings with the sponsor and project team to review status, issues, scope changes and schedule updates. Gather status, issues and schedule update information from the team one day before the status meeting in order to compile and distribute the Status Report .

Project Content Reviews
The Project Manager should coordinate, if not facilitate, reviews of requirements, plans and deliverables with company management, including business requirements reviews with business personnel and technical reviews with project technical personnel. Set a process in place beforehand to ensure appropriate personnel are invited, any relevant documents are distributed at least 24 hours in advance, and that reviews focus on questions and issues (rather than a laborious "reading of the code"). Reviews may include:

• • • • • • • • •

Project scope and business case review Business requirements review Source analysis and business rules reviews Data architecture review Technical infrastructure review (hardware and software capacity and configuration pla nning) Data integration logic review (source to target mappings, cleansing and transformation logic, etc.) Source extraction process review Operations review (operations and maintenance of load sessions, etc.) Reviews of operations plan, QA plan, deployment and support plan

PAGE BP-208

BEST PRACTICES

INFORMATICA CONFIDENTIAL

Change Management
Directly address and evaluate any changes to the planned project activities, priorities, or staffing as they arise, or are proposed, in terms of their impact on the project plan.

• • •

Use the Scope Change Assessment to record the background problem or requirement and the recommended resolution that constitutes the potential scope change. Review each potential change with the technical team to assess its impact on the project, evaluating the effect in terms of schedule, budget, staffing requirements, and so forth. Present the Scope Change Assessment to the Project Sponsor for acceptance (with formal sign-off, if applicable). Discuss the assumptions involved in the impact estimate and any potential risks to the project.

The Project Manager should institute this type of change management process in response to any issue or request that appears to add or alter expected activities and has the potential to affect the plan. Even if there is no evident effect on the schedule, it is important to document these changes because they may affect project direction and it may become necessary, later in the project cycle, to justify these changes to management.

Issues Management
Any questions, problems, or issues that arise and are not immediately resolved should be tracked to ensure that someone is accountable for resolving them so that their effect can also be visible. Use the Issues Tracking template, or something similar, to track issues, their owner, and dates of entry and resolution as well as the details of the issue and of its solution. Significant or "showstopper" issues should also be mentioned on the status report.

Project Acceptance and Close
Rather than simply walking away from a project when it seems complete, there should be an explicit close procedure. For most projects this involves a meeting where the Project Sponsor and/or department managers acknowledge completion or sign a statement of satisfactory completion.

• •

Even for relatively short projects, use the Project Close Report to finalize the project with a final status report detailing: o What was accomplished o Any justification for tasks expected but not completed o Recommendations Prepare for the close by considering what the project team has learned about the environments, procedures, data integration design, data architecture, and other project plans. Formulate the recommendations based on issues or problems that need to be addressed. Succinctly describe each problem or recommendation and if applicable, briefly describe a recommended approach.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

PAGE BP-209

repositories. While this is less important in a development/unit test environment. it is imperative to answer the following basic questions: • • • • • • • • Who needs access to the Repository? What do they need the ability to do? Is a central administrator required? What permissions are appropriate for him/her? Is the central administrator responsible for designing and configuring the repository security? If not. has a security administrator been identified? What levels of permissions are appropriate for the developers? Do they need access to all the folders? Who needs to start sessions manually? Who is allowed to start and stop the Informatica Server? How will PowerCenter security be administered? Will it be the same as the database security scheme? Do we need to restrict access to Global Objects? The following pages offer some answers to the these questions and some suggestions for assigning user groups and access privileges. Before implementing security measures. Security should be implemented with the goals of easy maintenance and scalability. sessions. Determining an optimal security configuration for a PowerCenter environment requires a thorough understanding of business requirements. Description Configuring security is one of the most important components of building a Data Warehouse. and data – in order to ensure system integrity and data confidentiality. data content. and end users’ access requirements. PAGE BP-210 BEST PRACTICES INFORMATICA CONFIDENTIAL . There should be a limit to the number of administrator accounts for PowerCenter. In most implementations. batches. it is critical for protecting the production environment.Configuring Security Challenge Configuring a PowerCenter security scheme to prevent unauthorized access to mappings. folders. Knowledge of PowerCenter’s security facilities is also a prerequisite to security design. the administrator takes care of maintaining the Repository.

and the command line program. All security management is performed through the Repository Manager. Every user ID must be assigned to one or more groups. Can edit metadata in the Designer. groups. Although privileges can be assigned to users or groups. and any user can belong to more than one group. Global Object permissions. with users then added to each group. It is used to assign read. The Server Manager also offers an enhanced security option that allows you to specify a default set of privileges that applies restricted access controls for Global Objects. Can create and modify folders. privileges. The following table summarizes some possible privileges that may be granted: Privilege Session Operator Use Designer Browse Repository Create Sessions and Batches Administer Repository Administer Server Description Can run any sessions or batches. privileges are commonly assigned to groups. and delete sessions and batches in Server Manager. Write and Execute Read and Execute No Permissions Enabling Enhanced Security does not lock the restricted access settings for Global Objects. This approach is simpler than assigning privileges on a user-by-user basis. Can configure connections on the server and INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-211 . The Server Manager provides another level of security for this purpose. The internal security enables multi-user development through management of users. Can create. Only the owner of the Object or a Super User can manage permissions for a Global Object. the Repository Manager. and execute permissions for global objects. Can browse repository contents through the Repository Manager. not database users. modify. affect the ability to perform tasks in the Server Manager. pmcmd.PowerCenter’s security approach is similar to database security environments. These are PowerCenter users. The Repository may be connected to sources/targets that contain sensitive information. in addition to privileges and permissions assigned using the Repository Manager. Global Objects include Database Connections. regardless of folder level permissions. all password information is encrypted and stored in the repository. FTP Connections and External Loader Connections. write. since there are generally few groups and many users. This means that the permissions for Global Objects can be changed after enabling Enhanced Security. and folders. Choosing the Enable Security option activates the following set of default privileges: User Owner Owner Group World Default Global Object Permissions Read.

Use Designer. and to place all reusable objects within that sharable folder. privileges are set for the owner. Browse Repository. Write Execute Allowing shortcuts enables other folders in the same repository to share objects such as source/target tables. meaning that shortcuts can be created pointing to objects within the folder. Business end users who run reports off of the data warehouse. and also determine whether the folder is shareable. that folder inherits the properties of the object. Browse Repository Super User Administrator Users with Administer Repository or Super User privileges may edit folder properties. PAGE BP-212 BEST PRACTICES INFORMATICA CONFIDENTIAL . and create shortcuts to repository objects in the folder. Can perform all tasks with the repository and the server The next table suggests a common set of initial groups and the privileges that may be associated with them: Group Developer Description PowerCenter developers who are creating the mappings. Create Sessions and Batches Browse Repository End User Operator Session Operator. A recommended practice is to create only one shareable folder per repository. After a folder is flagged as shareable. thereby enabling object reuse.. Write. Can run sessions using mappings in the folder. and repository (i. Data warehouse Administrators who maintain the entire warehouse environment. and mappings. When other folders create a shortcut from a shareable folder.Privilege Super User Description stop the server through the Server Manager or the command-line interface. Operations department that runs and maintains the environment in production. group. Privileges Session Operator. Administer Server.e. which must identify a folder owner and group. any user). Users without read permissions cannot see the folder. and Execute: Privilege Read Description Can read. copy. For each folder. so changes to common logic or elements can be managed more efficiently. transformations. this property cannot be changed. Can edit metadata in the folder. The following table details the three folder level privileges: Read.

A folder owner should be allowed all three folder level permissions. while everyone else should have the appropriate privileges within the folders they use. it is difficult to identify which developer is making (or has made) changes to an object. Note that users with the Session Operator privilege can run sessions or batches. the group assigned to the folder. it cannot be opened and modified by anyone but that user. Only a few people should have Administer Repository or Super User privileges. One of the most important reasons for this is session level locking. only members of the ABC group can make changes to those folders. if multiple individuals share a common login ID. depending on the desired level of security. Repository privileges should be restricted to Read permissions only.Users who own a folder or have Administer Repository or Super User privileges can edit folder properties to change the owner. Tight security is recommended in the production environment to ensure that the developers and other users do not accidentally make changes to production. Informatica recommends creating individual User IDs for all developers and administrators on the system rather than using a single shared ID. or possibly all three levels. Also. regardless of folder level permissions. In this example. Locks thus prevent repository corruption by preventing simultaneous uncoordinated updates. This enables you to assign folder level security to the group and keep the two projects from accidentally working in folders that belong to the other project team. you may assign group level security for all of the ABC folders to the ABC group. You might also wish to add a group specific to each application if there are many application development tasks being performed within the same repository. However members within the folder’s group may contain only Read/Write. For example. and the Allow Shortcuts option. ABC and XYZ. INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-213 . In this way. if you have two projects. When a session is in use by a developer. the three levels of privileges. it may be appropriate to create a group for ABC developers and another for XYZ developers. if any at all.

PAGE BP-214 BEST PRACTICES INFORMATICA CONFIDENTIAL .