ETL Design Questionnaire

Applies to:
Informatica PowerCenter

Summary
This article provides a questionnaire that can be useful when being involved in the design process of an interface.

Author Bio
Author(s): Matthias Urech Company: interface-development.com Created on: January 8, 2009 Matthias Urech, founder and project leader of interface - development.com, has a proven track record of applying data integration solutions for several companies across the industry. He played a key role to implement projects in the area of data integration, data migration, data consolidation and data warehousing. Matthias contributed several articles about data integration best practices for Informatica Technology Network, resulting in nomination as member of developer wall of fame and receiving an award as contributor of the year in 2008.

Informatica Technology Network © 2009 Informatica Corporation. All Rights Reserved.

http://technet.informatica.com 1

....................informatica.......................................................................................................10 Question 12: How is the interface data flow? ...............................................................................................................................................................................................3 ETL Design Questionnaire...................................................................8 Question 9: Which data volume is extracted? ...............................5 Question 3: How is the data food chain? ........................................................................................................................10 Question 13: How is the target layout? ..................................................................13 Question 18: What level of documentation is provided?................................................................................................................................................11 Question 14: How is the logical mapping?........................7 Question 7: What are the load frequencies? .........................................................................................................................4 Question 1: Who is involved? ............................................... All Rights Reserved.......4 Question 2: What is the scope of the interface? .................6 Question 5: How is the data life cycle? ...................................................11 Question 15: Where are data quality issues addressed? ...............ETL Design Questionnaire Table of Contents Introduction ......................................................5 Question 4: How is the relationship between the systems? .................9 Question 11: What is the load order? ........................................................com 2 .............................................................................................................15 Informatica Technology Network © 2009 Informatica Corporation......................................................................................................................................................................................................................................................................................................................................................12 Question 17: How is the scheduling configuration?............9 Question 10: What is the strategy to load data?...14 Disclaimer and Liability notice ...............................................................................................................................................................................................................................................................................................................12 Question 16: What are the operational actions?....................................................................................8 Question 8: What are the load types?......................7 Question 6: How are changes captured? ................................................... http://technet........................................................................

and have a good costumer communication and collaboration. respond to changes. you are faced then with the challenge of gathering requirements. At the end of the day. your goal should be to provide the requirements for developing a working interface.ETL Design Questionnaire Introduction Do not design a bridge by counting the number of people who swim across the river today.informatica. As a matter of fact. But sometimes you will be involved earlier in the project to design the interface. using the proposed questions in this article will neither prevent you for going through the design process nor is the list of questions complete. It is neither about creating comprehensive documentation nor strictly answering all questions. All Rights Reserved. That’s also true for ETL projects. your work starts sometimes when the data flow needs to be built. Informatica Technology Network © 2009 Informatica Corporation. Depending on your role in the ETL project.com 3 . tables or graphical elements (let’s call them diagrams). Sounds that the questionnaire is an exciting and useful tool to work with. but will put you also in a position of controlling the design process. That’s where the ETL Design Questionnaire comes in place. Fine. Decide yourself what you want to use! Like everybody has a schema for getting rich that will not work. someone will tell you what you have to do. But what makes the difference compared to other methodologies or frameworks (i. The main goal is to gather as much information with a simple approach. The key word here is “tailoring”. you will be recognized as a professional ETL developer that has a plan. I hope nonetheless that the provided ETL Design Questionnaire will be useful to you and to your challenges.e. Informatica Velocity)? Each of the provided questions is supported by figures. http://technet. As a side effect. Asking the right questions is not only essential. Consider this set of questions rather as a presentation of multiple views in order to get a common understanding about the interface for all involved parties.

The objective of the role/task diagram is to define the involved people and their responsibilities. In detail: • • • • • • • • • Performing data analysis Defining data quality strategy Gather business rules Develop interface Establish test plans Perform reconciliation Execute tests Prepare deployment and support rollout Documentation of interface These are just some of the tasks and the list is by far not complete. the ETL team is responsible for extracting. In short: you can simple ask "who does what?". More specifically.ETL Design Questionnaire ETL Design Questionnaire Question 1: Who is involved? Basically. However.com 4 . all those tasks have to be done by someone. For example: subject matter expert (who) provides business rules (what). All Rights Reserved.informatica. there are more tasks to think about within and outside the ETL team. transforming and loading data. http://technet. Figure 1: Role/Task Diagram Informatica Technology Network © 2009 Informatica Corporation.

For example: time and expense data has been posted after the month end (event) will read daily charged hours for each employee (input) with the interface and deliver aggregated hours for the financial period (output). All Rights Reserved. The data food chain diagram should state the involved systems (box). you want to know what causes an event that provides input for the interface and what is the output. http://technet.com 5 . This gives you as early as possible an overview about constraints and dependencies in the project.informatica. we want to understand the scope before building the interface. interfaces (arrow) and description of the data types (arrow description).ETL Design Questionnaire Question 2: What is the scope of the interface? Consider the interface as a black box for the moment. Figure 2: Interface Scope Diagram Question 3: How is the data food chain? Understanding the data food chain is important in order to get the big picture about the involved systems. First. Figure 3: Data Food Chain Diagram Informatica Technology Network © 2009 Informatica Corporation. In detail.

The goal is to identify the system relationship in order to find out the system that owns the data.com 6 . • • Of course. data will be maintained in both systems. data will be maintained in both systems. Only system 1 will be able to update data in system 2. Master / Master (one direction) In this relationship. This will help you also to see if you need to build additional interfaces or checks to ensure referential integrity. This relationship shows that both systems are able to update data. Figure 4: System Relationships Figure 4 illustrates the three types of system relationships: • Master / Slave This is the most common relationship. In such case you should prioritize the data flow order and check if the data food chain makes sense at all. additional efforts (either manual or automatic) have to be done to prevent data inconsistency and loss of data quality.informatica. the road still doesn’t end here since some systems are connected to more than one (see data food chain diagram). Data will be maintained in system 1 and provided to system 2. Informatica Technology Network © 2009 Informatica Corporation. Master / Master (both directions) As already mentioned in the previous relationship. All Rights Reserved. http://technet. Therefore.ETL Design Questionnaire Question 4: How is the relationship between the systems? After drawing the data food chain you are able to easily focus on the systems that are involved in the ETL project.

SYSDATE -1).g. Some systems only allow to flag data inactive instead of delete them. All Rights Reserved. it is the one • • • • Informatica Technology Network © 2009 Informatica Corporation.informatica. Timed Data Selection Timed data selection is when you check if the create or modified date fields are equal to a certain date (e.ETL Design Questionnaire Question 5: How is the data life cycle? By knowing the types of relationships. Interface changes are captures by retrieving a full copy of the target system in order to compare it with a full copy of the source system. There are several ways to capture changes: • Log Parsing Parsing logs is about scanning logs and capturing changes according to log entries. Current Data Selection In order to get only the current record is to select the end date field that contains NULL or a date in the future (e. While this is an effective way of doing it also contains some dangers since it’s the nature of logs to get full and do not log anything anymore. you will potentially loose all changes if the log is truncated before you run the interface. Audit Columns Most source systems have audit columns to store the date and time a record was added or modified. create) in system 1 will cause a certain action in system 2. http://technet. you have to think about providing additional data loads to deliver the full dataset in case the snapshot runs out of sync. you need to have a plan B to deliver the changes or a complete dataset.g. 12/31/9999). Changed Data Capture Changed data capture preserves that either the source system or the interface captures the changes. you are now able to draw the data flow in the data life cycle diagram. Moreover. Although this approach is not the most efficient technique. This approach works fine if the date in the audit columns is indicating only effective changes and the modified date does not overwrite the create date. What’s left is to move the data flow arrow horizontal to define at which point of time an action (i. Question 6: How are changes captured? The maintenance of data content is a key element in the development phase of an interface. Otherwise. This works fine if the interface is running only once per day and do never fail. Otherwise. For example: the data flow arrow will point from system 1 to system 2 in case of a Master/Slave relationship. There’s no problem as long as the target system detects the changes. The most common approach to detect changes is to compare the last modified date with the date when the interface ran the last time.e. This approach requires that you have a snapshot of data of the last interface run in order to compare and determine the changes.com 7 . Figure 5: Data Life Cycle Diagram Please note that the given actions in system 1 are just examples.

copy of previous extraction) will have the risk to miss some data. monthly. the frequency has an impact how the interface has to deal with the data at the point of execution. It’s important that you know all constraints and consider them during development.g. Question 8: What are the load types? The load type diagram outlines the load types and date ranges that are used in the interface. intra-daily or ad-hoc basis.informatica. Figure 6: Load Type Diagram Informatica Technology Network © 2009 Informatica Corporation. you should think through all possible cases and write all load frequencies down. Please consider that this list is by far not complete. weekly. Any other attempt (e. Instead. The frequency to load data can by on a yearly. Choosing the approach that best fit to the environment is key.ETL Design Questionnaire that is most reliable. quarterly. It’s essential that you only take an exact copy of the target system. All Rights Reserved.com 8 . http://technet. daily. Question 7: What are the load frequencies? Interfaces are scheduled to be executed on a certain frequency. Spending enough research time will prevent you from troubles running the interface in production. While scheduling itself is an operational task.

load type and data volume. reduce duplicates and complexity.com 9 . date (load type) and amount (data volume). This includes current and history data without applying any filter. Having a proper load strategy will help you in getting a clear understanding about what data the interface will Extract. All Rights Reserved. Transform and Load (ETL). flag). In other words: it’s like a cube that has three dimensions: frequency (load frequency).informatica. History Dataset Only history data is provided. the load strategy table also contains the parameters that can be used in order to make the interface more flexible.ETL Design Questionnaire Question 9: Which data volume is extracted? Knowing the extracted data volume is not only important for performance tuning but also essential for delivering exactly the required amount of data. load type and data volume in order to identify gaps. Question 10: What is the strategy to load data? Defining a load strategy is nothing else than building a relation between load frequency. This can reduce the overhead of creating multiple interfaces when only certain attributes of an interface need to be changed. report date. The data volume contains one of the following datasets: • Complete Dataset All data is provided. • • • Determining the data volume may take some detective work. Current Dataset For each row only the most current data (known by source) is provided. Table 1: Load Strategy Table Informatica Technology Network © 2009 Informatica Corporation. http://technet. Once you have defined the load strategy it’s recommended to review each combination of load frequency. You might have to perform source system analysis. Reduced Dataset Filter is applied with one or more criteria (e. client number. study the history concept of the source system or simply talk to subject matter experts in order to find out the criteria that are applied to the dataset.g. In addition.

transform and load (ETL) process. Figure 7: Interface Data Flow Diagram Informatica Technology Network © 2009 Informatica Corporation. a load order table also shows jobs that are running in parallel.informatica. Especially when you are loading data to a data warehouse or a target with enabled constraints. All Rights Reserved.ETL Design Questionnaire Question 11: What is the load order? Defining dependencies between jobs is very important. you can’t enforce referential integrity. Table 2: Load Order Table Question 12: How is the interface data flow? The interface data flow diagram is mostly used to outline the extract. The goal is to have a common understanding about the data flow and the involved applications and actions to deliver data between the systems. Without. http://technet. Beside of outlining dependencies.com 10 .

http://technet.informatica. Therefore. The logical mapping is like water. The logical mapping table helps you defining the linking of source and target fields and to document business rules. It’s easier to build something on it if linking and business rules are frozen.ETL Design Questionnaire Question 13: How is the target layout? All that matters is the result of the solution or in other words: what is loaded into the target. the earlier you know what you have to provide the earlier you can begin with the development. All Rights Reserved.com 11 . Table 3: Target Definition Table Question 14: How is the logical mapping? We presume that source and target is known. Table 4: Logical Mapping Table Informatica Technology Network © 2009 Informatica Corporation.

you have to put your hands again on the interface. You should address as many data quality issues to the source as possible since future interface development initiatives would otherwise have to deal with it again. http://technet.ETL Design Questionnaire Question 15: Where are data quality issues addressed? Here. Table 6: Operational Action Table Informatica Technology Network © 2009 Informatica Corporation. As a result. Thinking about operational steps from the beginning will help you identify hidden requirements and perform accurate effort estimates. However.informatica. Table 5: Data Quality Assignment Table Question 16: What are the operational actions? Some operational actions are overseen during development. it is about defining if you should care about data quality.com 12 . some issues like incomplete data might be best addressed in the interface. All Rights Reserved.

ETL Design Questionnaire Question 17: How is the scheduling configuration? In most organizations operational tasks are not done by the developer. the interface will be handed over and an operational handbook is provided. The operational handbook contains information about how to schedule the interface. a scheduling configuration table is also provided explaining in detail how the scheduling job is setup and executed. Table 7: Scheduling Configuration Table Informatica Technology Network © 2009 Informatica Corporation. Therefore. Figure 9 displays the scheduling workflow with all its dependencies: Figure 8: Scheduling Workflow Diagram Moreover. All Rights Reserved.com 13 . http://technet.informatica.

http://technet. All Rights Reserved.informatica.ETL Design Questionnaire Question 18: What level of documentation is provided? Table 8 supports you in defining the documentation scope: Table 8: Documentation Decision Table Informatica Technology Network © 2009 Informatica Corporation.com 14 .

or seek to hold. All Rights Reserved. Informatica responsible or liable with respect to the content of this software asset. Informatica Technology Network © 2009 Informatica Corporation. http://technet. You agree that you will not hold.ETL Design Questionnaire Disclaimer and Liability notice Informatica offers no guarantees and assumes no responsibility or liability of any type with respect to the content of this software asset.informatica.com 15 . including any liability resulting from incompatibility between the content within this asset and the materials and services offered by Informatica.

Sign up to vote on this title
UsefulNot useful