You are on page 1of 10

ETL Overview

Extract, Transform, Load (ETL) General ETL issues


 ETL/DW refreshment process
 Building dimensions
 Building fact tables
 Extract
 Transformations/cleansing
 Load
MS Integration Services

Original slides were written by


Torben Bach Pedersen
Aalborg University 2007 - DWML course 2

The ETL Process Refreshment Workflow

The most underestimated process in DW development


The most time-consuming process in DW development
 80% of development time is spent on ETL!
Extract
 Extract relevant data
Transform Integration
 Transform data to DW format phase
 Build keys, etc.
 Cleansing of data
Load
 Load data into DW
 Build aggregates, etc.
Preparation
phase
Aalborg University 2007 - DWML course 3 Aalborg University 2007 - DWML course 4
ETL In The Architecture Data Staging Area (DSA)
Transit storage for data in the ETL process
 Transformations/cleansing done here
ETL side Query side
Metadata No user queries
Data
sources Query Reporting Tools Sequential operations on large data volumes
Services
Presentation servers -Warehouse Browsing
Desktop Data  Performed by central ETL logic
Access Tools
- Extract Data marts with
-Access and Security  No need for locking, logging, etc.
- Transform aggregate-only data -Query Management
- Load - Standard Reporting Data mining  RDBMS or flat files? (DBMS have become better at this)
Data Staging Data Conformed -Activity Monitor
Area Warehouse dimensions
Bus
Operational Finished dimensions copied from DSA to relevant marts
and facts system

Data marts with Allows centralized backup/recovery


atomic data
 Often too time consuming to initial load all data marts by failure
 Backup/recovery facilities needed
Data Service  Better to do this centrally in DSA than in all data marts
Element

Aalborg University 2007 - DWML course 5 Aalborg University 2007 - DWML course 6

ETL Construction Process Building Dimensions


 Plan Static dimension table
1) Make high-level diagram of source-destination flow  DW key assignment: production keys to DW keys using table
2) Test, choose and implement ETL tool  Combination of data sources: find common key?
3) Outline complex transformations, key generation and job  Check one-one and one-many relationships using sorting
sequence for every destination table Handling dimension changes
 Construction of dimensions  Described in last lecture
4) Construct and test building static dimension  Find the newest DW key for a given production key
5) Construct and test change mechanisms for one dimension  Table for mapping production keys to DW keys must be updated
6) Construct and test remaining dimension builds Load of dimensions
 Construction of fact tables and automation  Small dimensions: replace
7) Construct and test initial fact table build  Large dimensions: load only changes
8) Construct and test incremental update
9) Construct and test aggregate build (you do this later)
10) Design, construct, and test ETL automation

Aalborg University 2007 - DWML course 7 Aalborg University 2007 - DWML course 8
Building Fact Tables Types of Data Sources
Two types of load
Initial load
Non-cooperative sources
 ETL for all data up till now
 Snapshot sources provides only full copy of source, e.g., files
 Done when DW is started the first time
 Specific sources each is different, e.g., legacy systems
 Very heavy - large data volumes
 Logged sources writes change log, e.g., DB log
Incremental update  Queryable sources provides query interface, e.g., RDBMS
 Move only changes since last load
Cooperative sources
 Done periodically (e.g., month or week) after DW start
 Replicated sources publish/subscribe mechanism
 Less heavy - smaller data volumes
 Call back sources calls external code (ETL) when changes occur
Dimensions must be updated before facts  Internal action sources only internal actions when changes occur
 The relevant dimension rows for new facts must be in place DB triggers is an example
 Special key considerations if initial load must be performed again
Extract strategy depends on the source types

Aalborg University 2007 - DWML course 9 Aalborg University 2007 - DWML course 10

Extract Computing Deltas


Delta = changes since last load
Goal: fast extract of relevant data Store sorted total extracts in DSA
 Extract from source systems can take long time  Delta can easily be computed from current+last extract
Types of extracts:  + Always possible
 Extract applications (SQL): co-existence with other applications  + Handles deletions
 DB unload tools: faster than SQL-based extracts  - High extraction time
Extract applications the only solution in some scenarios Put update timestamp on all rows (in sources)
Too time consuming to ETL all data at each load  Updated by DB trigger
 Extract only where timestamp > time for last extract
 Extraction can take days/weeks
 + Reduces extract time
 Drain on the operational systems
 - Cannot (alone) handle deletions
 Drain on DW systems
 - Source system must be changed, operational overhead
 => Extract/ETL only changes since last load (delta)

Aalborg University 2007 - DWML course 11 Aalborg University 2007 - DWML course 12
Changed Data Capture Common Transformations
Messages
 Applications insert messages in a queue at updates
 + Works for all types of updates and systems Data type conversions
 - Operational applications must be changed+operational overhead
 EBCDIC  ASCII/UniCode
DB triggers  String manipulations
 Triggers execute actions on INSERT/UPDATE/DELETE  Date/time format conversions
 + Operational applications need not be changed Normalization/denormalization
 + Enables real-time update of DW
 To the desired DW format
 - Operational overhead
 Depending on source format
Replication based on DB log Building keys
 Find changes directly in DB log which is written anyway
 Table matches production keys to surrogate DW keys
 + Operational applications need not be changed
 Correct handling of history - especially for total reload
 + No operational overhead
 - Not possible in some DBMS

Aalborg University 2007 - DWML course 13 Aalborg University 2007 - DWML course 14

Data Quality Cleansing

Data almost never has decent quality BI does not work on raw data
Data in DW must be:  Pre-processing necessary for BI analysis
 Precise Handle inconsistent data formats
DW data must match known numbers - or explanation needed
 Spellings, codings,
 Complete
DW has all relevant data and the users know Remove unnecessary attributes
 Consistent  Production keys, comments,
No contradictory data: aggregates fit with detail data Replace codes with text (Why?)
 Unique  City name instead of ZIP code,
The same things is called the same and has the same key Combine data from multiple sources with common key
(customers)
 E.g., customer data from customer address, customer name,
 Timely
Data is updated frequently enough and the users know when

Aalborg University 2007 - DWML course 15 Aalborg University 2007 - DWML course 16
Types Of Cleansing Cleansing
Conversion and normalization
 Text coding, date formats, etc.
Mark facts with Data Status dimension
 Most common type of cleansing
 Normal, abnormal, outside bounds, impossible,
Special-purpose cleansing  Facts can be taken in/out of analyses
 Normalize spellings of names, addresses, etc.
Uniform treatment of NULL
 Remove duplicates, e.g., duplicate customers
 Use explicit NULL value rather than special value (0,-1,)
Domain-independent cleansing  Use NULLs only for measure values (estimates instead?)
 Approximate, fuzzy joins on records from different sources
 Use special dimension keys for NULL dimension values
Rule-based cleansing Avoid problems in joins, since NULL is not equal to NULL
 User-specifed rules, if-then style Mark facts with changed status
 Automatic rules: use data mining to find patterns in data  New customer, Customer about to cancel contract,
Guess missing sales person based on customer and item

Aalborg University 2007 - DWML course 17 Aalborg University 2007 - DWML course 18

Improving Data Quality Load

Goal: fast loading into DW


Appoint data quality administrator  Loading deltas is much faster than total load
 Responsibility for data quality
 Includes manual inspections and corrections!
SQL-based update is slow
 Large overhead (optimization, locking, etc.) for every SQL call
Source-controlled improvements
 DB load tools are much faster
 The optimal?
Construct programs that check data quality Index on tables slows load a lot
 Are totals as expected?  Drop index and rebuild after load
 Do results agree with alternative source?  Can be done per index partition
 Number of NULL values? Parallellization
Do not fix all problems with data quality  Dimensions can be loaded concurrently
 Allow management to see weird data in their reports?  Fact tables can be loaded concurrently
 Such data may be meaningful for them? (e.g., fraud detection)  Partitions can be loaded concurrently

Aalborg University 2007 - DWML course 19 Aalborg University 2007 - DWML course 20
Load ETL Tools
ETL tools from the big vendors
Relationships in the data
 Oracle Warehouse Builder
 Referential integrity and data consistency must be ensured (Why?)
 IBM DB2 Warehouse Manager
 Can be done by loader
 Microsoft Integration Services
Aggregates Offers much functionality at a reasonable price
 Can be built and loaded at the same time as the detail data
 Data modeling
Load tuning  ETL code generation
 Load without log  Scheduling DW jobs
 Sort load file first 
 Make only simple transformations in loader
The best tool does not exist
 Use loader facilities for building aggregates
 Choose based on your own needs
Should DW be on-line 24*7?  Check first if the standard tools from the big vendors are OK
 Use partitions or several sets of tables (like MS Analysis)

Aalborg University 2007 - DWML course 21 Aalborg University 2007 - DWML course 22

Issues MS Integration Services


Pipes
 Redirect output from one process to input of another process
ls | grep 'a' | sort -r
Files versus streams/pipes A concrete ETL tool
 Streams/pipes: no disk overhead, fast throughput
 Files: easier restart, often only possibility Example ETL flow
ETL tool or not Demo
 Code: easy start, co-existence with IT infrastructure
 Tool: better productivity on subsequent projects
Load frequency
 ETL time dependent of data volumes
 Daily load is much faster than monthly
 Applies to all steps in the ETL process

Aalborg University 2007 - DWML course 23 Aalborg University 2007 - DWML course 24
Integration Services (IS) Packages

Microsofts ETL tool


 Part of SQL Server 2005
The central concept in IS
Tools
Package for:
 Import/export wizard - simple transformations
 Sources, Connections
 BI Development Studio advanced development
 Control flow
Functionality available in several ways  Tasks, Workflows
 Through GUI - basic functionality
 Transformations
 Programming - advanced functionality
 Destinations


Aalborg University 2007 - DWML course 25 Aalborg University 2007 - DWML course 26

Package Control Flow Tasks


Containers provide
 Structure to packages
Data Flow runs data flows
 Services to tasks
Data Preparation Tasks
 File System operations on files
Control flow  FTP up/down-load data
 Foreach loop container Workflow Tasks
Repeat tasks by using an enumerator  Execute package execute other IS packages, good for structure!
 For loop container  Execute Process run external application/batch file
Repeat tasks by testing a condition SQL Servers Tasks
 Sequence container  Bulk insert fast load of data
Groups tasks and containers into  Execute SQL execute any SQL query
control flows that are subsets of the Scripting Tasks
package control flow  Script execute VN .NET code
Task host container Analysis Services Tasks
 Provides services to a single task  Analysis Services Processing process dims, cubes, models
Arrows:  Analysis Services Execute DDL create/drop/alter cubes, models
green (success) Maintenance Tasks DB maintenance
red (failure)
Aalborg University 2007 - DWML course 27 Aalborg University 2007 - DWML course 28
Event Handlers Data Flow Elements
Sources
 Makes external data available
 All ODBC/OLE DB data sources:
Executables (packages, RDBMS, Excel, Text files,
containers) can raise events Transformations
Event handlers manage the events  Update
Similar to those in languages  Summarize
JAVA, C#
 Cleanse
 Merge
 Distribute
Destinations
 Write data to specific store
 Create in-memory data set
Input, Output, Error output

Aalborg University 2007 - DWML course 29 Aalborg University 2007 - DWML course 30

Transformations A Simple IS Case


Business intelligence transformations Use BI Dev Studio/Import Wizard to copy TREO tables
 Term Extraction - extract terms from text
 Term Lookup look up terms and find term counts Save in
Row Transformations  SQL Server
 Character Map - applies string functions to character data
 Derived Column populates columns using expressions
 File system
Rowset Transformations (rowset = tabular data) Look at package structure
 Aggregate - performs aggregations  Available from mini-project web page
 Sort - sorts data
 Percentage Sampling - creates sample data set by setting % Look at package parts
Split and Join Transformations  DROP, CREATE, source, transformation, destination
 Conditional Split - routes data rows to different outputs
 Merge - merges two sorted data sets Execute package
 Lookup Transformation - looks up ref values by exact match  Error messages?
Other Transformations
 Export Column - inserts data from a data flow into a file
Steps execute in parallel
 Import Column - reads data from a file and adds it to a data flow  But dependencies can be set up
 Slowly Changing Dimension - configures update of a SCD

Aalborg University 2007 - DWML course 31 Aalborg University 2007 - DWML course 32
ETL Demo

Load data into the Product dimension table


 Construct the DW key for the table by using IDENTITY
 Copy data to the Product dimension table
Load data into the Sales fact table
 Join raw sales table with other tables to get DW keys for each
sales record
 Output of the query written into the fact table

Aalborg University 2007 - DWML course 33 Aalborg University 2007 - DWML course 34

ETL Part of Mini Project A Few Hints on ETL Design


Core: Dont implement all transformations in one step!
 Build an ETL flow using MS DTS that can do an initial (first-time)  Build first step and check that result is as expected
load of the data warehouse  Add second step and execute both, check result
 Include logic for generating special DW surrogate integer keys for  Add third step
the tables
Test SQL before putting into IS
 Discuss and implement basic transformations/data cleansing
Do one thing at the time
Extensions:
 Copy source data one-one to DSA
 Extend the ETL flow to handle incremental loads, i.e., updates to
the DW, both for dimensions and facts  Compute deltas
Only if doing incremental load
 Extend the DW design and the ETL logic to handle slowly
changing dimensions of Type 2  Handle versions and DW keys
 Implement more advanced transformations/data cleansing Versions only if handling slowly changing dimensions

 Perform error handling in the ETL flow  Implement complex transformations


 Load dimensions
 Load facts

Aalborg University 2007 - DWML course 35 Aalborg University 2007 - DWML course 36
Summary

General ETL issues


 The ETL process
 Building dimensions
 Building fact tables
 Extract
 Transformations/cleansing
 Load
MS Integration Services

Aalborg University 2007 - DWML course 37

You might also like