ETL Review1

Documents needed for review process 1) Design documents (SRS, SRDS) 2) Retirement plan for existing objects 3) Source
system related details 4) Impact analysis document for existing objects, if any 5) Extraction frequency (Daily, Weekly or Monthly) 6) Target system related details, if target is outside database 7) Create new informatica folder and unix folders with standard naming convention 8) Install guide (MD120) 9) ETL Transformation specifications 10) Checking the naming conventions 11) Review of mappings & sessions (Technical Issues) 12) FMEA document 13) Cronacle documents and Chains 14) Checking the performance 15) Co-ordinate with development team to setup the automated ftp for source files 16) Unit test plan 17) Migrating ETL objects into QA Design documents The design documents such as the SRS and the SRDS should be updated and all the details should be present, all hyperlinks should work and the parts not applicable should be clearly highlighted. This will help to save time and effort to review the entire document since the SRS and the SRDS are detailed documents, which need a considerable time and effort to review The SRDS document should furnish details as: Transformation Specifications Capacity Plan Test Plan Impact Analysis Document High Level Logical Data Model
GE Confidential
Transformation Specifications The transformation specifications should contain details about all the transformations that have been used in the mappings the transformation logic should be explained on a high level. This should also have information about the source and the target type viz the source or the target is a flat file, or RDBM system. Capacity Plan The database capacity plan should be highlighted Test Plan The Unit Test Plan and the System Test Plan should be prepared highlighting all the test cases and also it should have information that whether the entire module passed all the tests in the development environment. Any sort of extraordinary errors encountered should be mentioned. Informatica Project Review (M1, M2-M3, M4) MILESTONES M1 CHECK POINTS 1. SRS and SRDS documents should be correct from documentation perspective eg. The hyperlinks should work. Not applicable points should be clearly indicated. 2. SRDS should have details about the Transformation specs, Capacity Plan, Test Plan, Impact Analysis document, High level Logical model etc. 3. Standard templates should be used for Transformation specification, Capacity Plan. 4. If some new project is coming (not enhancement or remediation), in such cases, whether they are going for oracle or teradata database as target database. If they are going to load into oracle, we need to ask what is the reason they are not going for teradata. 1. Detailed Transformation specifications with all details such as, the source or target system database/schema name, tranformation logic for individual target column. Etc. 2. For Dos and Donts, Informatica Developers Coding standard should be referred. Available in BI Ops Manual. 3. One mapping, One session, One Workflow: Reason: Workflows are scheduled by Cronacle and debugging of failures become easier. 4. In one mapping, ONLY one independent flow from source to target is ALLOWED. More than one independent flow from source to target is NOT ALLOWED. In such case, create separated mappings, sessions and workflows. 5. In workflow, the option Fail parent if this task fails and Fail parent if this task does not run should be checked in task/Sessions. 6. Mention in MD120 clearly if you are changing the default values of commit interval, DTM buffer size, Enable high precision etc properties of session. 7. Informatica mappings should be tuned properly. Unwanted Transformation (specially expression) should be avoided. Refer Developers coding guidelines. 8. In lookups, for the unwanted ports (Not used in lookup conditions or as
M2-M3
GE Confidential
output ports) in lookups, the "O" option should be 9. Hard coding in sql override (in source qualifier or Lookup sql override) should be avoided or if possible should be made generic. e.g. The fiscal month,year,date related logic should be implemented dynamically instead of hard coding if possible. 10. In loading strategy, Teradata mload or fastload should preferably be used rather than the teradata relational odbc connections. 11. All the log file, bad file paths should be used as the production support team gives. Reason behind it is, the root directory structure difference between the gemsdw2 or gemsdw1 and gemsdw8p servers for Informatica setup and default values set for Informatica server variables. 12. eg. /ftp/ in gemsdw1 or gemsdw2, relevant directory in gemsdw8p is dwftp/.GEMSDW8P is secured server. Being it is secured, no files can be ftped to this server using normal ftp from external to this server. So currently the data files come to GEMSDW2 server, from the informatica server 6.2 (set up in GEMSDW8) can read the files. Caution:If you are manipulating the output flat files of informatica mappings or data files, the scripts operating on these file should be on GEMSDW8P and data files should be ftped to relevant directory in GEMSDW8P to make them available to scripts. 13. MD120 is the baseline document for the BI Operations team from Migration (within various environments dev/QA/Preprod/Prod) perspective. It should be 100% accurate and provide all the information need for migration.eg.replace/reuse instruction in case of Source/Target definitions, mappletes, especially reusable components. Sequence generators current values after moving into production. Lookup transformation name and its location details (database user name@database name). Source and Target details if database (db user name@database name for oracle/teradata) along teradata mload or fastload information and also whether insert/update/upsert or delete mode should be used for external loaders. If the source is flat files, mention clearly paths of flat files and names or flat files. Mention clear instruction of changing the paths for input files mentioned indirect files used by Informatica session if required. In case of change control meant for bug fixing, enhancement, mention the changes done in mappings with mapping name and changes. 14. MD120 should have the details of database details of the environment (to which the code is going to be migrated) eg.after M3, informatica code will be migrated to test, the database details should be of test environment specially teradata.) 15. Avoid calling large complicated stored procedures from informatica mapping if possible to code in informatica mappings. 16. In case of lookups, if your lookup resides in source or target database, instead of using source or target database odbc connection value for lookups, use $Source or $Target, mention the connection details in properties tab in session (if reusable otherwise in tasks). 17. Any loader connections or odbc connections should not be created by program team in development repository, BI Operations team should informed by entering SPR. 18. Performance tuning point of view, if the source qualifier is followed by aggregator, the sorted input ports option should be utilized if feasible, Lookups in source database if possible can be moved to source qualifier having sql override based on performance of sql query override. Filters
GE Confidential
should be closer to source, if you are updating only or inserting only, use option of treat rows as update/insert as required rather than using update strategy and treating rows as data driven. If you are using variables in expressions, the sequence of ports should be input, variable, then output ports. If possible, use sql override in lookups to bring only required data from database to cache using filter conditions. 19. If the source are files, the details of source files should be provide in MD120 like from which system it is going to come, what is the frequence, who is contact person etc. 20. As decided, the migration from development to test, test to preprod is done by changing the existing connections details to required environment connection details.eg.INV_DEV_INS will be change to INV_TEST_INS and internally it will be pointed to test environment connections. 21. No presession or post session command are allowed. That should be step in cronacle job chain. It makes easier to debug the failures. 22. In installation instructions, the repository names and folders should be clearly mentioned. If a particular object (mapping/workflow/session etc) already exist then MD120 should clearly state that the backup of existing object needs to be taken prior to migration. 23. Paths for Session log, bad files, workflow log file for a particular project should be used in development as This need to be changed for all workflows. Reject File directory :-> $PMBadFileDir/<Application_folder>/ Workflow log directory :-> $PMWorkflowLogDir/ <Application_folder>/ Session Log Directory-> $PMSessionLogDir/ <Application_folder>/ The following relevant directory structure should be used if required. Source File Directory :-> $PMSourceFileDir/ <Application_folder>/ Output and Merge File Directory :-> $PMTargetFileDir/ <Application_folder>/ Parameter file directory-> $PMSourceFileDir/ <Application_folder>/scripts/ User created Scripts directory:-> /ftp/scripts/ <Application_folder>/ 24. DB link are not allowed to use in source qualifier sql override. 25. Tracing level for all the transformation should be NORMAL. 26. If source data is having Chinese /Japanese characters, the code page for the oracle database connections should be UTF- 8 27. If the source data have Unicode character, to handle those Unicode characters, its mandatory to mention to use UTF-8 code page for creation of Teradata Relational Connections. If mload or fastload is being used, the output file properties to be set to have UTF-8 code page. 28. The step of Generation of Reject records log and sending in email should be added at the end of Cronacle Job chain. 29. The source and target definition changes should be done by importing the same from database instead of making changes in definitions MANUALLY in Informatica designer. M4 1. MD120 should have database connection details of teradata preprod environment. 2. The folder to be migrated from Americas Test Repository to Americas QA
GE Confidential
Reference
Repository should be clean i.e. should have only mapping, sessions and workflow those need to be migrated. 1. http://gemsbidev.med.ge.com/opsmanual/ 2. http://uswaubus02medge.med.ge.com/webhouse/reports/BI_Life_Cycle.htm
Informatica Standards 1. Naming Convention for Informatica Objects Object Type Folder Mapping Naming Convention XXX_<Data Mart Name> m_XY_<Target Table Name>_<OPR>_V Where XY = DL for Daily Load, ML for Monthly Load <Target Table Name> = For Teradata: The ETL View Name (for Dim & Fact) or Stage table Name For Oracle: The Target Table Name OPR = INS (Insert), DEL (Delete), UPD (Update), UPS (Upsert) V = Version No e.g. 1,2,3 etc. It should not consist of dot . s_<Mapping Name>[ _optional session versions] wkf_<Session Name> Note: If any temporary workflow created for testing purpose, prefix the workflow name with word TEST <Source Table Name> For Teradata: The ETL View Name (for Dim & Fact) or Stage table Name For Oracle: The Target Table Name <Target Table Name> For Teradata: The ETL View Name (for Dim & Fact) or Stage table Name For Oracle: The Target Table Name AGG_<Purpose> EXP_<Purpose> FLT_<Purpose> JNR_<Names of Joined Tables> LKP_<Lookup Table Name> <Lookup Table Name>: For Teradata: The ETL View Name (for Dim & Fact) or Stage table Name For Oracle: The Target Table Name NRM_<Source Name> RNK_<Purpose> RTR_<Purpose> SEQ_<Target Column Name> SQ_<Source Table Name> STP_<Database Name>_<Procedure Name> UPD_<Target Table Name>_xxx MPP_<Purpose> INP_<Description of Data being funneled in> OUT_<Description of Data being funneled out>
Session Workflow Source Definition
Target Definition
Aggregator Expression Filter Joiner Lookup
Normalizer Rank Router Sequence Generator Source Qualifier Stored Procedure Update Strategy Mapplet Input Transformation Output
GE Confidential
Transformation Database Connections ODBC Connection Name
XXX_<Database Name>_<Schema Name> <Schema_Name/ Data_Base_Name> For Oracle use Schema Name e.g. INDLOAD For Teradata use Data Base Name: e.g. SRC_ETL_TARGET NOTE* Do not use any suffix or prefix to the ORDBC Connection name. Eg. SRC_ETL_TARGET_1 or SRC_ETL_TARGET_test etc. There should be only one ODBC Connection for one Database for same Project
2.
Naming Convention for Ports Naming Convention I_<Field Name > (Port name should be Prefixed By I) O_<Field Name> (Port name should be Prefixed By O) <Field Name> V_<Field Name> (Port name should be Prefixed By V) Port Name Should be in Capitals P_<Parameter Name> (Port name should be Prefixed By P)
Port Type Input Only Output Only Input Out Put Variable Port ALL PORT Mapping Parameter 3.
Naming Convention for Parameters/ Paths Naming Convention <Session Name>.log <Workflow Name>.log <Target Table Name>.out <Target Table Name>: For Teradata: The ETL View Name (for Dim & Fact) or Stage table Name For Oracle: The Target Table Name NOTE: File name should not be more than 30 Char in length src_did_<purpose>_param.txt Note: A single parameter file for multiple sessions is preferred unless there is some specific requirement. $PMSourceFileDir/xxx/ Where xxx = <3 Letter Project Code in Lower Case> $PMTargetFileDir/xxx/ Where xxx = <3 Letter Project Code in Lower Case> $PMSessionLogDir/xxx/ Where xxx = <3 Letter Project Code in Lower Case> $PMWorkflowLogDir/xxx/ Where xxx = <3 Letter Project Code in Lower Case> /ftp/scripts/xxx/ Where xxx = <3 Letter Project Code in Lower Case> /ftp/scripts/xxx/ Where xxx = <3 Letter Project Code in Lower Case>
Path/ Parameter Session Log Workflow Log Output File Name
Parameter File Name Source File Directory Target File/ Merge Directory Session Log File Directory Workflow Log File Directory Parameter File Directory Directory to keep User Created shell Script 4.
Review & Migration Checklist:
GE Confidential
4.1 Naming convention for Informatica Objects (Refer Naming

Convention for Informatica Objects section in this Document).
4.2 Naming convention for Informatica Parameters / Paths. (Refer

Naming Convention for Parameters/ Paths section in this Document). 4.3 Mapping Level Check List
1) SQL Override: No SQL Override should be used in the Source Qualifier unless
used to join multiple tables. It should not be overwritten to pass any filter. In such scenario just mention the filter in the filter section. 2) Hard Coded DB Name: There should not be any hard coded database name in any sort of SQL Override. This is applicable to SQL Override used in the SQ or Lookup 3) Filter Usage: Use Filter in the SQ if possible or as close as possible to SQ. 4) Update Strategy Transformation: While loading to a Teradata target Use UPSERT /DELETE/UPDATE Logic in session level instead of UPDATE STRETEGY Transformation. 5) Unconnected Ports: No Ports should connected forward if not used by any Transformation. 6) Lookup Vs. Unconnected Lookup: If same technicality can be achieved by both Connected and un-connected Lookup, connected lookup should be used. 7) Usage of Lookup: If it is avoidable using a lookup by putting it into SQ, it is suggested to do so. 8) Caching: Caching should be enables in Lookup 9) Joiner: Joiner should not be used to join two table from same Database 10) Target Load Order: MultipleTarget Load Order should not be used in a Single Mapping 11) Filter Used for Testing: All the filters or customization done in the mapping for the testing purpose has to be removed. 12) Sequence Generator: While migrating to QA/PREPROD (M3/M4 Review) Initial value should be set 1. 4.4 Session Level Checklist 4.4.1 Session Properties > General Tab: 1) Check the Fail parent if this task fails check box. 2) Check the Fail parent if this task does not run check box. 3) Put 1 or 2 line description for the session the description 4.4.2 Session Properties > Session Properties Tab: 4.4.2.1In General Options Sub Tab: 1) Session Log File Name should be as per standard 2) Session Log Directory Name should be as per standard 3) Parameter file name should be empty if no parameter is used 4) Enable test Load check box should be unchecked 5) $Source & $Target Connection Value should provided. Exception: the source/target type is Flat File.
GE Confidential
(Note* ODBC connection Name should be mentioned in case of External Loader Connection used) 6) Rest of the options should be as default unless other wise required. 4.4.2.2 Performance Sub Tab: 1) All the options should be as default unless other wise there is specific requirement. Any changes to default option should be justified. 4.4.3 Session Properties > Config Object Tab: 4.4.3.1Advance Sub Tab: 1) All the options should be as default unless other wise there is specific requirement. Any changes to default option should be justified. 4.4.3.2Log Options Sub Tab: 1) Set the Save session logs for this session to 1 4.4.3.3Error Handling Sub Tab: 1) All the options should be as default unless other wise there is specific requirement. Any changes to default option should be justified. 4.4.4 Session Properties > Sources Tab: 4.4.4.1Connections Sub Tab: 1) If it is required to transfer the source file from different server, do not use Informatica FTP Connection. It should be done through a separate FTP Script and Cronacle should call it. 4.4.5 Session Properties > Target Tab: 4.4.5.1Connections Sub Tab: 1) If it is required to transfer the target file to different server, do not use Informatica FTP Connection. It should be done through a separate FTP Script and Cronacle should call it. 2) Merge File Name & Directory should be empty if partitioning is not used. 3) Output File Path should be set as mentioned in the Naming Convention for Parameters/ Paths section. 4.4.6 Session Properties > Components Tab: 1) Pre-Session Command, Post-Session Success Command, Post-Session failure Command Should beset to None. If any such activity is required it should be done through a Cronacle Step. 2) On Success E-mail and On failure E-Mail should be set to None. Notification method should be set up through Cronacle Chain. 4.4.7 Session Properties > Transformation Tab: 1) Lookup Connection should be specified as indirect connection i.e. $Source or $Target. Direct reference to Relational DB connection should not be specified. 2) SQL or Filter Condition should not be over written. It should be same as the mapping. Expiation: Multiple Sessions are used for a single mapping and different filter condition or SQL Over Rides are used in different Session. In such scenario DB Name should not be hardcode.
GE Confidential
4.5 Checklist for MD120: 1) All the Mapping, Session, Workflow name, which need to be migrated, should be mentioned Correctly 2) All the Source & Target Connection should be mentioned Correctly. In case the source is flat file, the path and file name should be mentioned. During M3 Migration Mention the DB Connection or Path with respect to TEST/QA Server. During M4 review it should be done for PREPROD Server. 3) Incase of loading to Teradata, Loader Type (MLOAD/FLOAD) and type of load (Insert, Update, Delete, Upsert, and Trunc-Insert) should be clearly mentioned for each session. 4) Provide all the Lookup Database Connection Information for each mapping. 5) If any of the CTL file has been created as READ-ONLY, please provide clear instruction to migrate them into respective environment and keep them as READ-ONLY. Also provide the CTL File name, current location and instruction to change the Log On Information, Database Name & Source File Path for CTL file. Keep a backup of these READ-ONLY CTL Files. 6) Provide a list of the Source/ Target tables expected to present before start loading in the new environment 7) Provide a list of the Lookup tables expected to present before start loading in the new environment 8) Provide the list of UNIX Scripts to be migrated, their location, and changes to be made in the script during migration. 9) Provide instruction to set the Initial Value and Current Value to 1 for all sequence generators. Cronacle Standards and Check lists Source Files All source files need to be defined with their server locations, file names, time of arrival, estimated size. Time Windows All scripts to be defined with their scheduled start and scheduled finish times. Project Each script should be part of one and only one project. The project name should be defined using the following nomenclature. <Value Chain>_<Module>_<Program>_<Project>. All the sub-parts should be in sync with the End-to-End Inventory. Notification For each script, email ids need to be specified for people who will be notified in case of failure, success and overdue.Contact Details will be used to contact only in case of emergencies and for Error Mails. BI Mailing List and Func Mailing List will be used for the purpose of sending delay mails only. Dependencies
GE Confidential
In case any job is dependant on flag files or job chains, it needs to specified in the dependencies and also mention the name of the job that creates the flag file Checklists Job chain 1. Job chain name should follow the standards 2. There should be one worksheet for each job chain with the job chain name as the title of the worksheet. 3. All Contact Details are mandatory, Email / Distribution List as well as Tel #. 4. Login ID is mentioned. 5. Job chain has been uploaded at the path mention in CVS 6. Job chain has been completely tested.This included all scripts and dependencies 7. All Input files are validated by Screen door. 8. Email List have been provided. 9. All fields under Scheduling information (marked in RED) are mandatory 10. All Fields are mandatory. Please mention 'NA' wherever not applicable. Dependencies 1. All Events are mentioned and are part of the job chain script provided in CVS 2. Are all source tables part of the dependent job chains 3. In case of File Events the location of the source as well as the contact details are mandatory 4. What will be the corrective action in case the input data file doesn't arrive? Scripts 1. If project specific scripts are used the description should be explain the purpose of creating such a script as against the usage of the generic script. Time Window 1. Check if any of the source tables are being loaded in the time frame when the job chain is expected to run. Source / Lookup Tables / Database Objects / Flat Files : Are all source tables, lookup tables, database objects and dependant flag files mentioned.
GE Confidential

ETL Review1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ETL Review1

Uploaded by

Copyright:

Available Formats

Documents needed for review process 1) Design documents (SRS, SRDS) 2) Retirement plan for existing objects 3) Source

Session Workflow Source Definition

Aggregator Expression Filter Joiner Lookup

Transformation Database Connections ODBC Connection Name

Path/ Parameter Session Log Workflow Log Output File Name

Review & Migration Checklist:

4.1 Naming convention for Informatica Objects (Refer Naming

4.2 Naming convention for Informatica Parameters / Paths. (Refer

You might also like