Using Informatica to Load Teradata at Cisco

Using Informatica With Teradata Load Utilities Cisco uses Informatica for data extracts and loads. Informatica has the ability to load data into Teradata databases, both by using an ODBC connection, and by building and launching Teradata load utility scripts. The Teradata load utilities are designed to load massive amounts of data in a short amount of time. Loading using ODBC should be considered for very small tables only. This document discusses using three Teradata utilities: fastload, multiload, and tpump, and is valid through Informatica version 7.1.2. Fastload: Fastload inserts large volumes of data very rapidly into Teradata tables. It can load one table from multiple input files. The biggest restriction with Fastload is that the table being loaded must be empty. This is useful for initial loads, or loading tables that are emptied prior to scheduled loads. But it can’t be used for incremental updates. Fastload will not load duplicate rows into a table, even if the table is created as a multiset table. Completely duplicate input rows don’t cause errors; they are simply dropped during the load process. A table being fastloaded is not available to users for queries. Multiload: Multiload supports insert, update, delete, and upsert operations for up to five target tables. It can apply conditional logic to determine what updates to apply. Its speed approaches that of Fastload. Multiload is limited to one input file. Tables being multiloaded are available for select access only. Tpump: Tpump is generally used for low volume maintenance of large tables, and/or near realtime maintenance. It does row-at-a-time processing using SQL, and is slower than Fastload and Multiload. A table being maintained by tpump is available for other updates while at the same time the tpump is running against the table. Tpump does not support multiple input files. When deciding which load utility to select, you must consider the volume of data, the frequency of the load, and what type of availability is needed for the table while it is being loaded. All three utilities provide some level of restartability following errors. The table on the next page compares the features of the three load utilities.

11/07/2005

Page 1

Using Informatica to Load Teradata at Cisco

Feature DDL Functions DML Functions Multiple DML Multiple Tables Multiple Sessions Protocol Used Conditional Expressions Arithmetic Calculations Data Conversion Error Files Error Limits User-written Routines

Fastload Limited Insert No No Yes FASTLOAD No No 1 per column Yes Yes Yes

Multiload All Ins/Upd/Del Yes Yes Yes MULTILOAD Yes Yes Yes Yes Yes Yes

Tpump All Ins/Upd/Del Yes Yes Yes SQL Yes No Yes Yes Yes Yes

Informatica/Teradata Connections The load method for an Informatica mapping is set on the mapping tab of the session, under TARGET. For Teradata load utilities, Writer is set to File Writer, Connection Type is set to Loader, and Value is set to the name of the connection. Connections are set up using the Connections tab in Workflow Designer. Attribute TDPID Database Name Date Format Description Teradata server Database containing the target table. Leave blank, assuming the value loaded into a date column in a target is a date/time type in Informatica. Max # of rows that can be rejected before the job is aborted. (0 = no limit) # of rows (>= 60) or minutes (1-59) between checkpoints. If IS STAGED is selected, select a reasonable # of records or
Page 2

Fastload varies – td0 for POC varies N/A

Multiload varies – td0 for POC varies blank

Tpump varies – td0 for POC varies N/A

Error Limit Checkpoint

0 0

0

0

0 not staged, 0 >=10,000 staged

11/07/2005

Using Informatica to Load Teradata at Cisco

Attribute

Tenacity Load Mode

Drop Error Tables

External Loader Executable Max Sessions Default to one per AMP Sleep # of minutes between logon tries. Packing # of statements to pack into a Factor multi-statement request. Max is 600, default is 20. Statement Maximum rate at which Rate statements are sent to Teradata per minute. Unlimited if not specified. Serialize If set, actions to a given row are executed in order. Robust If off, simple restart logic is used (restart after last checkpoint). No Monitor If set, prevents Tpump from checking for statement rate changes to send to the monitor. Truncate If set, all rows in target table Target Table are deleted prior to the load job starting. Is Staged Data is written to a flat file before the load job starts. Error Database where error tables Database will be created. Work Table Database where work tables
11/07/2005 Page 3

Description amount of time based on the size of the output file. If the connection is not staged, this should be set to 0 (no checkpoints). # of hours the job will keep trying to logon the required sessions. Insert, Update, Delete, Upsert, or Data Driven. Data driven uses the property set in the update strategy transformation in the mapping. Specifies whether or not to drop the error tables prior to starting the loader. Name of the loader executable.

Fastload

Multiload

Tpump

4 N/A

4 Upsert

4 Upsert

No fastload 80 for POC 6 N/A N/A

No mload 80 for POC 6 N/A N/A

No tpump 10 for POC 6 1 blank

N/A N/A N/A Off Off Varies (dw_errlog) N/A

N/A N/A N/A Off Off Varies (dw_errlog) Varies

On Off On Off Off Varies (dw_errlog) N/A

Using Informatica to Load Teradata at Cisco

Attribute Database Log Table Database

Description will be created. Database where log table will be created.

Fastload N/A

Multiload (dw_errlog) Varies (dw_errlog)

Tpump N/A

Staged vs. Not Staged When a loader connection has IS STAGED selected, Informatica will write output to a flat file on the Informatica server. Data is sent to the target database only after Informatica has completed creating the flat file. Informatica does not delete the flat file after the loader has completed. If a loader connection is not staged, Informatica will start sending data to the target database using named pipes as soon as it has data to send. After job completion, there is no flat file. Source disk space requirements and restartability requirements need to be considered when choosing which option to use. Restarting Load Jobs Multiload Staged: If a job abends prior to the application phase, you can choose to restart the job, or abandon the job. If it is restarted, it will pick up after the last checkpoint. To abandon the job, execute a RELEASE MLOAD statement against the target table, and drop the error and log tables. If the job has entered the application phase, you either have to restart it, or drop the target table, recreate it, and restore the data from a backup. Not Staged: If a job abends prior to the application phase, it can’t be restarted. Since there isn’t an input file, there’s no way to guarantee that the input will match the original input, and data corruption can occur. If the job abends in the application phase, it must be restarted, or dropped and recreated. Fastload The same considerations apply regarding staged and not staged input. It’s usually easiest to drop/recreate the table and start from the top. Tpump Staged: Restart the tpump job. It will use the error and log tables to determine where it left off. Not Staged: The job can’t be restarted.

11/07/2005

Page 4

Using Informatica to Load Teradata at Cisco

Troubleshooting When Informatica launches a Teradata load job, the session waits for a return code from the load job. If a zero return code is received, the session will be reported as successful; non-zero will result in a failure. But a successful load job doesn’t necessarily mean that all rows were loaded successfully. Some or all of the rows may have been rejected and sent to the error table. Or rows that were assumed to be inserts were actually updates due to duplicate keys in the input data. Following any load job, its log should be checked to determine the actual results of the job. The log files are written to the …/TgtFiles directory, with an extension of ‘ldrlog’. There are two areas to look to find the relevant information. The number of inserts, updates, and deletes will be reported in the application section of the log. Entries in the clean-up section will report the number of rows sent to the error table(s). The error tables are created in the database specified in the Informatica connection. They are dropped at the end of the job if they are empty, so the existence of an error table after a load job indicates that at least one row was rejected. Look at the rows in the error table to find the error code. When a load job is running much more slowly than expected, it’s a good idea to check the number of rows in the associated error tables. Rows are written one at a time into the error table, as opposed to the much faster writes to the target tables. If all or most of the rows are being rejected, the writes to the error tables will slow down the load job. If this number is very high, you may want to abort the load job, fix the problem, then rerun it. The most common causes of rows being rejected are not null violations resulting from failed lookup transformations, or data conversion errors.

11/07/2005

Page 5

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.