Informatica's Velocity Methodology

Error Handling Strategies - General

Challenge
The challenge is to accurately and efficiently load data into the target data architecture. This Best Practice describes various loading scenarios, the use of data profiles, an alternate method for identifying data errors, methods for handling data errors, and alternatives for addressing the most common types of problems. For the most part, these strategies are relevant whether your data integration project is loading an operational data structure (as with data migrations, consolidations, or loading various sorts of operational data stores) or loading a data warehousing structure.

Description
Regardless of target data structure, your loading process must validate that the data conforms to known rules of the business. When the source system data does not meet these rules, the process needs to handle the exceptions in an appropriate manner. The business needs to be aware of the consequences of either permitting invalid data to enter the target or rejecting it until it is fixed. Both approaches present complex issues. The business must decide what is acceptable and prioritize two conflicting goals: The need for accurate information. The ability to analyze or process the most complete information available with the understanding that errors can exist.

Data Integration Process Validation
In general, there are three methods for handling data errors detected in the loading process: Reject All. This is the simplest to implement since all errors are rejected from entering the target when they are detected. This provides a very reliable target that the users can count on as being correct, although it may not be complete. Both dimensional and factual data can be rejected when any errors are encountered. Reports indicate what the errors are and how they affect the completeness of the data. Dimensional or Master Data errors can cause valid factual data to be rejected because a foreign key relationship cannot be created. These errors need to be fixed in the source systems and reloaded on a subsequent load. Once the corrected rows have been loaded, the factual data will be reprocessed and loaded, assuming that all errors have been fixed. This delay may cause some user dissatisfaction since the users need to take into account that the data they are looking at may not be a complete picture of the operational systems until the errors are fixed. For an operational system, this delay may affect downstream transactions.

© 2012 Informatica Corporation. All rights reserved. Phoca PDF

Inserts are important for dimensions or master data because subsequent factual data may rely on the existence of the dimension data row in order to load properly. and 2) as inserts or updates. but incorrect detail numbers. All of the target data structures may contain incorrect information that can lead to incorrect decisions or faulty transactions. The development strategy may include removing information from the target. Key elements are required fields that maintain the data integrity of the target and allow for hierarchies to be summarized at various levels in the organization. After the data is fixed. With Reject None. with detail information being redistributed along different hierarchies. but the data may not support correct transactions or aggregations. Rejected elements are reported as errors so that they can be fixed in the source systems and loaded on a subsequent run of the ETL process. The effort also incorporates some tasks from © 2012 Informatica Corporation. This approach gives users a complete picture of the available data without having to consider data that was not available due to it being rejected during the load process. since the rejected data can be processed through existing mappings once it has been fixed.Informatica's Velocity Methodology The development effort required to fix a Reject All scenario is minimal. resulting in grand total numbers that are correct. the complete set of data is loaded. reports may change. and it would then be loaded into the data mart using the normal process. and reprocessing the data. All rights reserved. All changes that are valid are processed into the target to allow for the most complete picture. these changes need to be propagated to all downstream data structures or data marts. The development effort to fix this scenario is significant. Phoca PDF . which can be a time-consuming effort based on the delay between an error being detected and fixed. The development effort for this method is more extensive than Reject All since it involves classifying fields as critical or non-critical. Reject Critical. After the errors are corrected. This method provides a balance between missing information and incorrect information. Updates do not affect the data integrity as much because the factual data can usually be loaded with the existing dimensional data unless the update is to a key element. Reject None. and developing logic to update the target and flag the fields that are in error. The problem is that the data may not be complete or accurate. This approach requires categorizing the data in two ways: 1) as key elements or attributes. a new loading process needs to correct all of the target data structures. Minimal additional code may need to be written since the data will only enter the target if it is correct. restoring backup tapes for each nights load. Once the target is fixed. It involves examining each row of data and determining the particular data elements to be rejected. Attributes provide additional descriptive information per key element. Factual data can be allocated to dummy or incorrect dimension rows.

but the logic needed to perform this UPDATE instead of an INSERT is complicated. business management needs to understand that some information may be held out of the target. which is invalid. When this error is fixed. Informatica generally recommends using the Reject Critical strategy to maintain the accuracy of the target. while at the same time screening out the unverifiable data fields. Handling Errors in Dimension Profiles Profiles are tables used to track history changes to the source data. and also that some of the information in the target data structures may be at least temporarily allocated to the wrong hierarchies. then the original field value is maintained. Field 2 is finally fixed to Red. If a third field is changed in the source before the error is fixed. This allows power users to review the target data using either current (As-Is) or past (As-Was) views of the data. By providing the most fine-grained analysis of errors. in that processes must be developed to fix incorrect data in the entire target data architecture. The first value passes validation. it would be desirable to update the existing profile rather than creating a new one. but Field 2 is still invalid. Date 1/1/2000 1/5/2000 Field 1 Value Open Sunday Field 2 Value Field 3 Value Closed Sunday Black BRed BRed Red Open 9 5 Open 9 5 Open 24hrs Open 24hrs 1/10/2000 Open Sunday 1/15/2000 Open Sunday Three methods exist for handling the creation and update of profiles: 1. Problems occur when two fields change in the source system and one of those fields results in an error. Field 1 changes from Closed to Open. On 1/5/2000. profile records are created with date stamps that indicate when the change took place. while the second value is rejected and is not included in the new profile. and Field 2 changes from Black to BRed. If a field value was invalid. As the source systems change. The following example represents three field values in a source system. Date 1/1/2000 Profile Date 1/1/2000 Field 1 Value Closed Sunday Field 2 Value Black Field 3 Value Open 9 5 © 2012 Informatica Corporation. which produces a new profile record.Informatica's Velocity Methodology the Reject None approach. On 1/10/2000. The first row on 1/1/2000 shows the original values. However. On 1/15/2000. The first method produces a new profile record each time a change is detected in the source. Field 3 changes from Open 9-5 to Open 24hrs. A profile record should occur for each change in the source data. Phoca PDF . All rights reserved. this method allows the greatest amount of valid data to enter the target on each run of the ETL process. the correction process is complicated further.

Recommended Method © 2012 Informatica Corporation. 2. but then causes an update to the profile records on 1/15/2000 to fix the Field 2 value in both. when in reality a new profile record should have been entered.Informatica's Velocity Methodology 1/5/2000 1/5/2000 Open Sunday Black Open 9 5 Open 24hrs Open 24hrs 1/10/2000 1/10/2000 Open Sunday Black 1/15/2000 1/15/2000 Open Sunday Red By applying all corrections as new profiles in this method. If we try to apply changes to the existing profile. we would identify it as a previous error. we run the risk of losing profile information.is applied as a new change that creates a new profile. Date 1/1/2000 1/5/2000 Profile Date Field 1 Value 1/1/2000 1/5/2000 Closed Sunday Field 2 Value Black Field 3 Value Open 9 5 Open 9 5 Open 24hrs Open 9-5 Open 24hrs Open Sunday Black Open Sunday Red 1/10/2000 1/10/2000 Open Sunday Black 1/15/2000 1/5/2000 (Update) 1/15/2000 1/10/2000 Open Sunday Red (Update) If we try to implement a method that updates old profiles when errors are fixed. but a new value is entered. which incorrectly reflects the changes in the source system. The second profile should not have been created. we need to create complex algorithms that handle the process correctly. The second method updates the first profile created on 1/5/2000 until all fields are corrected on 1/15/2000. even if we create the algorithms to handle these methods. in reality. as in this option. which loses the profile record for the change to Field 3. we still have an issue of determining if a value is a correction or a new value. The third method creates only two new profiles. Phoca PDF . 3. It involves being able to determine when an error occurred and examining all profiles generated since then and updating them appropriately. we simplify the process by directly applying all changes to the source system directly to the target. as in this method. Each change -regardless if it is a fix to a previous error -. And. If an error is never fixed in the source system. All rights reserved. a mistake was entered on the first change and should be reflected in the first profile. causing an automated process to update old profile records. If the third field changes before the second field is fixed. it would also be added to the existing profile. This incorrectly shows in the target that two changes occurred to the source information when. When the second field was fixed. we show the third field changed at the same time as the first.

when the process encounters a new. These records cannot be loaded to the target because they lack a primary key field to be used as a unique record identifier in the target. Metadata will be saved and used to generate a notice to the sending system indicating that x number of invalid records were received and could not be processed. no tracking is possible to determine whether the invalid record has been replaced or not. Support the resolution of specific record error types via an update and resubmission process. The following types of errors cannot be processed: A source record does not contain a valid key. no tracking is possible to determine whether the invalid record has been replaced or not. Data Quality Edits Quality indicators can be used to record definitive statements regarding the quality of the data received and stored in the target. it is likely that individual unique records within the file are not identifiable. © 2012 Informatica Corporation. is delayed until the data is examined and an action is decided. correct value it flags it as part of the load strategy as a potential fix that should be applied to old Profile records. A data quality indicator code is included in the DQ fields corresponding to the original fields in the record where the errors were encountered. This method only delays the As-Was analysis of the data until the correction method is determined because the current information is reflected in the new Profile. specific problems may not be identifiable on a record-by-record basis. The indicators can be append to existing data tables or stored in a separate table linked by the primary key. This record would be sent to a reject queue. If a record contains even one error.. Then. in the absence of a primary key. Metadata indicating that x number of invalid records were received and could not be processed may or may not be available for a general notice to be sent to the sending system. another process examines the existing Profile records and corrects them as necessary. or invalid data value. the corrected data enters the target as a new Profile record. but the process of fixing old Profile records. In this case. While information can be provided to the source system site indicating there are file errors for x number of records. Phoca PDF . All rights reserved. Quality indicators can be used to record several types of errors e. fatal errors (missing primary key value). In this way. Quality indicators can be used to: Show the record and field level quality associated with a given record at the time of extract. due to the nature of the error. Identify data sources and errors encountered in specific records. However. The file or record would be sent to a reject queue. data quality (DQ) fields will be appended to the end of the record. Records containing a fatal error are stored in a Rejected Record Table and associated to the original file name and record number. one field for every field in the record.g. missing data in a required field. wrong data type/format.Informatica's Velocity Methodology A method exists to track old errors so that we know when a value was rejected. Once an action is decided. and potentially deleting the newly inserted record. The source file or record is illegible. If the file or record is illegible.

apply a concise indication of the quality of the data within specific fields for every data type. capture and maintenance of quality indicators. but they contain errors: A required (non-key) field is missing. and then reloading the information correctly. This can present credibility problems when trying to track the history of changes in the target data architecture. The business needs to decide whether analysts should be allowed to fix data in the reject tables. the identified error type is recorded. The value in a numeric or date field is non-numeric. business process problems and information technology breakdowns. Source System As errors are encountered. restoring previous loads from tape. then these fixes must be applied correctly to the target data. The value in a field does not fall within the range of acceptable values identified for the field. implementation. All rights reserved.Informatica's Velocity Methodology In these error types. the records can be processed. they are written to a reject file so that business analysts can examine reports of the data and the related error messages indicating the causes of error. Reject Tables vs. Quality Indicators (Quality Code Table) The requirement to validate virtually every data element received from the source data systems mandates the development. these indicators provide the information necessary to identify acute data quality problems. When an error is detected during ingest and cleansing. these indicators provide the level of detail necessary for acute quality problems to be remedied in a timely manner. we cannot rule this out as a possible solution. or whether data fixes will be restricted to source systems. 2-Missing Data from a Required Field. 3-Wrong Data Type/Format. 4-Invalid Data Value and 5-Outdated Reference Table in Use. The quality indicators: 0-No Error. © 2012 Informatica Corporation. Typically. If errors are fixed in the reject tables. If all fixes occur in the source systems. a reference table is used for this validation. Aggregated and analyzed over time. data quality analysts and users to readily identify issues potentially impacting the quality of the data. These are used to indicate the quality of incoming data at an elemental level. These indicators provide the opportunity for operations staff. But how often should these corrections be performed? The correction process can be as simple as updating field information to reflect actual values. 1-Fatal Error. the target will not be synchronized with the source systems. Handling Data Errors The need to periodically correct data in the target is inevitable. Although we try to avoid performing a complete database restore and reload from a previous point in time. systemic issues. Phoca PDF . At the same time. or as complex as deleting data from the target.

Fixing this type of error involves integrating the two records in the target data. After a source system value is corrected and passes validation. When a source value does not translate into a reference table value.g. to find specific patterns for market research). All rights reserved. When errors are encountered in translating these values. Integrating the two rows involves combining the profile information. An analyst would be unable to get a complete picture. we use the value that represents off or No as the default. (e. If facts were loaded © 2012 Informatica Corporation. These types of errors do not generally affect the aggregated facts and statistics in the target data. In many cases. On/Off or Yes/No indicators). For example.g. taking care to coordinate the effective dates of the profiles to sequence properly. the attributes are most useful as qualifiers and filtering criteria for drilling into the data. Phoca PDF . then the location number is changed due to some source business rule such as: all Warehouses should be in the 5000 range. a location number is assigned and the new location is transferred to the target using the normal process. Primary Key Errors The business also needs to decide how to handle new dimensional values such as locations. we use the Unknown value.. default values can be assigned to let the new record enter thetarget. are referred to as small-value sets. If two profile records exist for the same day. When attribute errors are encountered for a new dimensional value. Problems occur when the new key is actually an update to an old key in the source system. (All reference tables contain a value of Unknown for this purpose. are handled on a case-by-case basis. with some data being attributed to the old primary key and some to the new. Attribute errors can be fixed by waiting for the source system to be corrected and reapplied to the data in the target. Attributes include things like the color of a product or the address of a store. The process assumes that the change in the primary key is actually a new warehouse and that the old warehouse was deleted.Informatica's Velocity Methodology Attribute Errors and Default Values Attributes provide additional descriptive information about a dimension concept. Fields that are restricted to a limited domain of values (e. This type of error causes a separation of fact data.) The business should provide default values for each identified attribute. which means undefined in the target. like numbers. Other values. the data integration process is set to populate Null into these fields. it is corrected in the target. then a manual decision is required as to which is correct. Attribute errors are typically things like an invalid color or inappropriate characters in the address. along with the related facts. Some rules that have been proposed for handling defaults are as follows: Value Types Reference Values Small Value Sets Other Description Default Attributes that are foreign Unknown keys to other tables Y/N indicator fields Any other type of attribute No Null or Business provided value Reference tables are used to normalize the target model to prevent the duplication of data.

After the errors are fixed. This nightly reprocessing continues until the data successfully enters the target data structures. we need to create processes that update them after the dimensional data is fixed. the ETL process can load data from the source systems into the target structures. From a data accuracy view. Reference data and translation tables enable the target data architecture to maintain consistent descriptions across multiple source systems.Informatica's Velocity Methodology using both primary keys. the process to fix the target can be time consuming and difficult to implement. then the related fact rows must be added together and the originals deleted in order to correct the data. If we let the facts enter downstream target structures. we save these rows to a reject table for reprocessing the following night.. In this case. Multiple source data occurs when two source systems can contain different data for the same dimensional entity. A translation table is associated with each reference table to map the codes to the source system values. the affected rows can simply be loaded and applied to the target data. the fix process becomes simpler. two primary keys mapped to the same target data ID really represent two different IDs). All rights reserved. The situation is more complicated when the opposite condition occurs (i. but used as measures residing on the fact records in the target. DM Facts Calculated from EDW Dimensions If information is captured as dimensional data from the source. it is necessary to restore the source information for both dimensions and facts from the point in time at which the error was introduced. products. New entities in dimensional data include new locations. Phoca PDF . then when we encounter errors that would cause a fact to be rejected. creating new entities in dimensional data. we must decide how to handle the facts. hierarchies. etc. If we load the facts with the incorrect data. If we reject the facts when these types of errors are encountered. regardless of how the source system stores the data. Using both of these tables. deleting affected records from the target and reloading from the restore to correct the errors. Initial and periodic analyses should be performed on the errors to determine why they are not being loaded. Each table contains a short code value as a primary key and a long description for reporting purposes. Fact Errors If there are no business rules that reject fact records except for relationship errors to dimensional data. Reference Tables The target data architecture may use reference tables to maintain consistent descriptions. © 2012 Informatica Corporation.e. we would like to reject the fact until the value is corrected. Data Stewards Data Stewards are generally responsible for maintaining reference tables and translation tables. and designating one primary data source when multiple sources exist.

Correcting the above example could be complex (e. For example. at a minimum. The ETL process uses the reference table to populate the following values into the target: Code Translation OFFICE STORE WAREHSE Office Retail Store Distribution Warehouse Code Description Error handling results when the data steward enters incorrect information for these mappings and needs to correct them after data has been loaded.g. Other source systems that maintain a similar field may use a two-letter abbreviation like OF. For location. but over time. Either require the data steward to enter the translation data before allowing the dimensional data into the target. S or W. but marks the record as Pending Verification until © 2012 Informatica Corporation. while the second lets the ETL process create the translation. The first option requires the data steward to create the translation for new entities. Dimensional Data New entities in dimensional data present a more complex issue..Informatica's Velocity Methodology The translation tables contain one or more rows for each source value and map the value to a matching row in the reference table. The data steward would be responsible for entering in the translation table the following values: Source Value O S W Code Translation OFFICE STORE WAREHSE These values are used by the data integration process to correctly load the target. the SOURCE column in FILE X on System X can contain O. (Other similar translation issues may also exist. The only way to determine which rows should be changed is to restore and reload source data from the first time the mistake was entered. products may have multiple source system values that map to the same product in the target. Phoca PDF . or create the translation data through the ETL process and force the data steward to review it. Dimensional data uses the same concept of translation as reference tables. including correction of the entire target data architecture. if the data steward entered ST as translating to OFFICE by mistake). ST and WH. Processes should be built to handle these types of situations. New entities in the target may include Locations and Products. These translation tables map the source system value to the target value.) There are two possible methods for loading new dimensional entities. The data steward would make the following entries into the translation table to maintain consistency across systems: Source Value OF ST WH Code Translation OFFICE STORE WAREHSE The data stewards are also responsible for maintaining the reference table that translates the codes into descriptions. All rights reserved. this is straightforward. but Products serves as a good example for error handling.

assumes a change occurred and creates another new profile with the old. requiring manual intervention. the two systems then contain different information.e. When the second system is loaded. This requires the data stewards to review the status of new values on a daily basis. Manual Updates Over time. a log of these fixes should be maintained to enable identifying the source of the fixes as manual rather than part of the normal load process. A potential solution to this issue is to generate an email each night if there are any translation table entries pending verification. Facts should be split to allocate the information correctly and dimensions split to generate correct profile information. data accuracy and profile problems are likely to occur. The situation is more complicated when the opposite condition occurs (i. any system is likely to encounter errors that are not correctable using source systems. facts may be rejected or allocated to dummy values. both sources have the ability to update the same row in the target. When the dimensional value is left as Pending Verification however. Profiles would also have to be merged. If the changed system is loaded into the target. Phoca PDF . one system may contain Warehouse and Store information while another contains Store and Hub information. When this happens. A problem specific to Product is that when it is created as new. it is difficult to decide which source contains the correct information. including beginning and ending effective dates. The data steward then opens a report that lists them. but really represent two different products). Because they share Store information. If both sources are allowed to update the shared information. If the two systems remain different. Multiple Sources The data stewards are also involved when multiple sources exist for the same data. This causes additional fact rows to be created.Informatica's Velocity Methodology the data steward reviews it and changes the status to Verified before any facts that reference it can be loaded. unchanged value. In this case. All rights reserved. it compares its old unchanged value to the new profile. it is really just a changed SKU number. Further. Affected records from the target should be deleted and then reloaded from the restore to correctly split the data. it creates a new profile indicating the information changed. it is necessary to restore the source information for all loads since the error was introduced. the process causes two profiles to be loaded every day until the two source systems are synchronized with © 2012 Informatica Corporation. A method needs to be established for manually entering fixed data and applying it correctly to the entire target data architecture. two products are mapped to the same product. which produces an inaccurate view of the product when reporting.. These dates are useful for both profile and date event fixes. the fact rows for the various SKU numbers need to be merged and the original rows deleted. For example. This occurs when two sources contain subsets of the required information. When this is fixed. If we update the shared information on only one source system.

this requires additional effort by the data stewards to mark the correct source fields as primary and by the data integration team to customize the load process. at the field level. All rights reserved. This allows developers to pull the information from the system of record.Informatica's Velocity Methodology the same information. only if the field changes on the primary source would it be changed. Another solution is to indicate. While this sounds simple. To avoid this type of situation. © 2012 Informatica Corporation. Developers can use the field level information to update only the fields that are marked as primary. the source that should be considered primary for the field.org) . knowing that there are no conflicts for multiple sources. because multiple sources can provide information toward the one profile record created for that day. Phoca PDF Powered by TCPDF (www. the business analysts and developers need to designate. it requires complex logic when creating Profiles. a primary source where information can be shared from multiple sources. However. at a field level.tcpdf. One solution to this problem is to develop a system of record for all sources. Then.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer: Get 4 months of Scribd and The New York Times for just $1.87 per week!

Master Your Semester with a Special Offer from Scribd & The New York Times