This action might not be possible to undo. Are you sure you want to continue?
Microsoft SQL Server Integration Services
Durable Impact Consulting, Inc.
Version 1.0 Wes Dumey
Copyright 2006 Protected by the ‘Open Document License’
In order to facilitate development of an open source ETL Methodology document. Page 2 of 12 6/13/2012 . By using this document you are agreeing to the terms listed above. This document or any portion thereof may not be sold or bartered for any form of compensation without expressed written consent of the original author.ETL Methodology Document Document Licensing Standards This document is protected by the copyright laws of the United States of America. users are permitted to modify this document at will. with the expressed understanding that any changes that improve upon or add to this methodology become property of the open community and must be forwarded back to the original author for inclusion in future releases of this document.
ETL Methodology Document Overview This document is designed for use by business associates and technical resources to better understand the process of building a data warehouse and the methodology employed to build the EDW. project signoff. This methodology has been designed to provide the following benefits: 1. Page 3 of 12 6/13/2012 . Durable Impact Consulting will use standard documents prepared by the Project Management Institute for requirements gathering. ETL Naming Conventions To maintain consistency all ETL processes will follow a standard naming methodology. Standard documentation techniques ETL Definitions Term ETL – Extract Transform Load Definition The physical process of extracting data from a source system. Ease of maintenance 4. Scalable to any size 3. transforming the data to the desired state. A high level of performance 2. Boiler-plate development 5. and loading it into a database The logical data warehouse designed for enterprise information storage and reporting A small subset of a data warehouse specifically defined for a subject area EDW – Enterprise Data Warehouse DM – Data Mart Documentation Specifications A primary driver of the entire process is accurate business information requirements. and compiling all testing information.
Processing times by job) Each type of table will be kept in a separate schema. This will decrease maintenance work and time spent looking for a specific table. Temp. Totals by day) Staging – Tables used to store data during ETL processing but the data is not removed immediately Temp – tables used during ETL processing that can immediately be truncated afterwards (ex. Aggregate. This table lists the job type. Staging. Fact – a table type that contains atomic data Dimension – a table type that contains referential data needed by the fact tables Aggregate – a table type used to aggregate data. Job Type Extract Load PSA Source and LoadTemp Explanation Extracts information from a source systems & places in a staging table Loads the persistent staging area Sources information from Page 4 of 12 Naming Convention Extract<Source><Subject> ExtractDICustomer LoadPSA<Table> Source<Table> 6/13/2012 . and explains the job functions. naming convention.ETL Methodology Document Tables All destination tables will utilize the following naming convention: EDW_<SUBJECT>_<TYPE> There are six types of tables used in a data warehouse: Fact. and Audit. Dimension. storing order ids for lookup) Audit – tables used to keep track of the ETL process (ex. Sample names are listed below the quick overview of table types. forming a pre-computed answer to a business question (ex. Table Name EDW_RX_FACT EDW_TIME_DIM EDW_CUSTOMER_AG ETL_PROCESS_AUDIT STG_DI_CUSTOMER ETL_ADDRESS_TEMP Explanation Fact table containing RX subject matter Dimension table containing TIME subject matter Aggregate table containing CUSTOMER subject matter Audit table containing PROCESS data Staging table sourced from DI system used for CUSTOMER data processing Temp table used for ADDRESS processing ETL Processing There following types of ETL jobs will be used for processing.
2007 – Created the job from standard template May 22. 2007 Author: Wes Dumey Revision History: April 21. 2007 – Added new columns to PSA tables In addition there will also be a job data dictionary that describes every job in a table such that it can be easily searched via standard SQL.ETL Methodology Document STG tables & performs column validation and loads temp tables used in processing Lookup Unload Dimensions Lookup and unload dimension tables into flat files LookupUnloadFacts Lookup and unload fact tables into flat files TransformFacts Transform the fact subject area data and generate insert files TransformDimensions Transform the dimension subject area data and generate insert files QualityCheck Checks the quality of the data before loaded into the EDW Aggregate Aggregates data Update Records Loads/Inserts the changed records into the EDW SourceSTGDICustomer LookupUnloadDimensions LookupUnloadFacts TransformFacts TransformDimensions QualityCheck<Subject> QualityCheckCustomer Aggregate Update Records ETL Job Standards All ETL jobs will be created with a boiler-plate approach. Comments Every job will have a standard comment template that specifically spells out the following attributes of the job: Job Name: LoadPSA Purpose: Load the ETL_PSA_CUSTOMERS Predecessor: Extract Customers Date: July 10. Page 5 of 12 6/13/2012 . This approach allows for rapid creation of similar jobs while keeping maintenance low.
This batch number is loaded with the data into the PSA and all target tables. For each run of the process. a unique batch number composed of the time segments is created. The table will contain the following layout: Column ROW_NUMBER DATE STATUS_CODE Data Type NUMBER DATE CHAR(1) Explanation Unique for each row in the PSA Date row was placed in the PSA Indicates status of row (‘I’ inducted. In addition. ‘R’ rejected) Code uniquely identifying problems with data if STATUS_CODE = ‘R’ Batch number used to process the data (auditing) ISSUE_CODE NUMBER BATCH_NUMBER Data columns to follow NUMBER Auditing The ETL methodology maintains a process for providing audit and logging capabilities. ‘F’ FAILURE Code of issue related to process failure (if ‘F’) Row count of records processed during run RECORD_PROCESS_COUNT NUMBER Page 6 of 12 6/13/2012 . The data will be stored in a PSA table following the naming standards listed previously. ‘P’ processed. Column DATE BATCH_NUMBER PROCESS_NAME PROCESS_RUN_TIME PROCESS_STATUS ISSUE_CODE Data Type DATE NUMBER VARCHAR TIMESTAMP CHAR NUMBER Explanation (Index) run date Batch number of process Name of process that was executed Time (HH:MI:SS) of process execution ‘S’ SUCCESS.ETL Methodology Document Persistent Staging Areas Data will be received from the source systems in its native format. an entry with the following data elements will be made into the ETL_PROCESS_AUDIT table.
2. QC can be configured to check any percentage of rows (0-100%) and any number of columns (1-X). Quality Due to the sensitive nature of data within the EDW. If an error is discovered. the data is fixed and a record is written in the ETL_QUALITY_ISSUES table (see below for table definition). Source job . Source Quality A data scrubbing mechanism will be constructed. Logging of Data Failures Page 7 of 12 6/13/2012 . 3. Quality Check Quality check is the last point of validation within the jobstream. Transform Quality The transformation job will employ a matching metadata technique. Any suspect rows will be pulled from the insert/update files. QualityCheck – a separate job is created to do a cursory check on a few identified columns and verify that the correct data is loaded into these columns. If the target table enforces NOT NULL constraints. This mechanism will check identified columns for any anomalies (ex. Quality will be handled through the following processes: 1. updated in the PSA table to a ‘R’ status and create an issue code for the failure.the source job will contain a quick data scrubbing mechanism that verifies the data conforms to the expected type (Numeric is a number and character is a letter). a check will be built into the job preventing NULLS from being loaded and causing a jobstream abend. Embedded carriage returns) and value domains.ETL Methodology Document The audit process will allow for efficient logging of process execution and encountered errors. data quality is a driving priority. Transform – the transform job will contain matching metadata of the target table and verify that NULL values are not loaded into NOT NULL columns and that the data is transformed correctly. QC will use a modified version of the data scrubbing engine used during the source job to derive correct values and reference rules listed in the ETL_QC_DRIVER table. QC is designed to pay attention to the most valuable or vulnerable rows with the data sets.
‘L’ LOW ETL_QUALITY_AUDIT Column DATE BATCH_NUMBER PROCESS_NAME RECORD_PROCESS_COUNT Data Type DATE NUMBER VARCHAR NUMBER Explanation Date of entry Batch number of process creating entry Name of entry creating process Number of records processed Number of records checked Percentage of checked records out of data set RECORD_COUNT_CHECKED NUMBER PERCENTAGE_CHECKED NUMBER ETL Job Templates Extract Page 8 of 12 6/13/2012 . An indicator will show the value of the column as defined in the rules (‘H’ HIGH. ETL_QUALITY_ISSUES Column DATE BATCH_NUMBER PROCESS_NAME COLUMN_NAME COLUMN_VALUE EXPECTED_VALUE ISSUE_CODE SEVERITY Data Type DATE NUMBER VARCHAR VARCHAR VARCHAR VARCHAR NUMBER CHAR Explanation Date of entry Batch number of process creating entry Name of process creating entry Name of column failing validation Value of column failing validation Expected value of column failing validation Issue code assigned to error ‘H’ HIGH.ETL Methodology Document Data that fails the QC job will not be loaded into the EDW based on defined rules. An entry will be made into the following table (ETL_QUALITY_ISSUES). This indicator will allow resources to be used efficiently to trace errors. ‘L’ LOW).
ETL Methodology Document Source Combined with Load Temp Page 9 of 12 6/13/2012 .
ETL Methodology Document Lookup Dimension Page 10 of 12 6/13/2012 .
ETL Methodology Document Transform Page 11 of 12 6/13/2012 .
ETL Methodology Document Load Closing After reading this ETL document you should have a better understanding of the issues associated with ETL processing. This methodology has been created to address as many negatives as possible while providing a high level of performance and ease of maintenance while being scalable and workable in a real-time ETL processing scenario. Page 12 of 12 6/13/2012 .
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.