Extract, Transform, Load

(ETL)

SAD 2007/08 H.Galhardas

Overview
• General ETL issues:
– Data staging area (DSA)
– Building dimensions
– Building fact tables
– Extract
– Load
– Transformation/cleaning
• Commercial (and open-source) tools
• The AJAX data cleaning and transformation
framework

SAD 2007/08 H.Galhardas

DW Phases
Design phase
– Modeling, DB design, source selection,…
Loading phase
– First load/population of the DW
– Based on all data in sources
Refreshment phase
– Keep the DW up-to-date wrt. source data
changes

SAD 2007/08 H.Galhardas

The ETL Process
• The most underestimated process in DW development

• The most time-consuming process in DW development
– Often, 80% of development time is spent on ETL

Extract
– Extract relevant data
Transform
– Transform data to DW format
– Build keys, etc.
– Cleansing of data
Load
– Load data into DW
– Build aggregates, etc.

SAD 2007/08 H.Galhardas

DW Architecture Monitor & OLAP Server Metadata other Integrator sources Analysis Operational Extract Query Data Transform Data Serve Reports DBs Staging Load Warehouse Data mining Refresh Data Marts Data Sources SAD 2007/08 Data StorageH.Galhardas .Galhardas OLAP Engine Front-End Tools ETL process other Implemented using an sources ETL tool! Operational Extract Data Transform Data Ex: SQLServer 2005 DBs Staging Load Warehouse Integration Services Refresh SAD 2007/08 H.

key generation and job sequence for every destination table Construction of dimensions 4)Construct and test building static dimension 5)Construct and test change mechanisms for one dimension 6)Construct and test remaining dimension builds Construction of fact tables and automation 7)Construct and test initial fact table build 8)Construct and test incremental update 9)Construct and test aggregate build 10)Design. logging.Galhardas ETL construction process Plan 1)Make high-level diagram of source-destination flow 2)Test. construct. – RDBMS or flat files? (DBMS have become better at this) • Finished dimensions copied from DSA to relevant marts SAD 2007/08 H.Galhardas . Data Staging Area • Transit storage for data underway in the ETL process – Transformations/cleansing done here • No user queries (some do it) • Sequential operations (few) on large data volumes – Performed by central ETL logic – Easily restarted – No need for locking. etc. and test ETL automation SAD 2007/08 H. choose and implement ETL tool 3)Outline complex transformations.

) after DW start – Less heavy -smaller data volumes • Dimensions must be updated before facts – The relevant dimension rows for new facts must be in place – Special key considerations if initial load must be performed again SAD 2007/08 H.Galhardas Building fact tables Two types of load: • Initial load – ETL for all data up till now – Done when DW is started the first time – Often problematic to get correct historical data – Very heavy -large data volumes • Incremental update – Move only changes since last load – Done periodically (./month/week/day/hour/.. Building Dimensions Static dimension table – Assignment of keys: production keys to DW using table – Combination of data sources: find common key? Handling dimension changes – Slowly changing dimensions – Find newest DW key for a given production key – Table for mapping production keys to DW keys must be updated Load of dimensions – Small dimensions: replace – Large dimensions: load only changes SAD 2007/08 H..Galhardas ..

g..Galhardas Extract phase Goal: fast extract of relevant data �Extract from source systems can take a long time Types of extracts: �Extract applications (SQL): co-existence with other applications �DB unload tools: must faster than SQL-based extracts �Extract applications sometimes the only solution Often too time consuming to ETL – Extracts can take days/weeks – Drain on the operational systems – Drain on DW systems => Extract/ETL only changes since last load (delta) SAD 2007/08 H.. e. legacy systems �Logged sources –writes change log (DB log) �Queryable sources –provides query interface. Types of data sources Non-cooperative sources �Snapshot sources –provides only full copy of source �Specific sources –each is different. e.g. SQL Cooperative sources �Replicated sources –publish/subscribe mechanism �Call back sources –calls external code (ETL) when changes occur �Internal action sources –only internal actions when changes occur (DB triggers is an example) Extract strategy is very dependent on the source types SAD 2007/08 H.Galhardas .

Computing deltas • Much faster to only ”ETL” changes since last load A number of methods can be used • Store sorted total extracts in DSA – Delta can easily be computed from current+last extract + Always possible + Handles deletions .locking.Less operational overhead .) for every SQL call – DB load tools are much faster – Some load tools can also perform UPDATEs • Index on tables slows load a lot – Drop index and rebuild after load – Can be done per partition • Parallellization – Dimensions can be loaded concurrently – Fact tables can be loaded concurrently – Partitions can be loaded concurrently SAD 2007/08 H.Galhardas .Source system must be changed SAD 2007/08 H.etc.Cannot (alone) handle deletions .Does not reduce extract time • Put update timestamp on all rows – Updated by DB trigger – Extract only where ”timestamp > time for last extract” + Reduces extract time +/.Galhardas Load (1) Goal: fast loading into DW – Loading deltas is much faster than total load • SQL-based update is slow – Large overhead (optimization.

RDBMSes can often do this • Load tuning – Load without log – Sort load file first – Make only simple transformations in loader – Use loader facilities for building aggregates – Use loader within the same database • Should DW be on-line 24*7? – Use partitions or several sets of tables SAD 2007/08 H.Galhardas Overview • General ETL issues: – Data staging area (DSA) – Building dimensions – Building fact tables – Extract – Load – Transformation/cleaning • Commercial (and open-source) tools • The AJAX data cleaning and transformation framework SAD 2007/08 H. Load (2) • Relationships in the data – Referential integrity must be ensured – Can be done by loader • Aggregates – Must be built and loaded at the same time as the detail data – Today.Galhardas .

Cleaning and Transforming to get… High-quality data! SAD 2007/08 H.3”.g. abbreviations. lacking certain attributes of interest.g.. now rating “A.. and inconsistencies. phonetic and typing errors. occupation=“” noisy: containing errors or outliers (spelling. prefix and suffix variations.g. duplicates.2. B.Galhardas . Data Cleaning Activity of converting source data into target data without errors.Galhardas Why Data Cleaning and Transformation? Data in the real world is dirty incomplete: lacking attribute values.. i..g. discrepancy between duplicate records SAD 2007/08 H. Age=“42” Birthday=“03/07/1997” • e. truncation and initials) • e. multiple values in a single free-form field) • e. Salary=“-10” inconsistent: containing discrepancies in codes or names (synonyms and nicknames.e. Was rating “1. C” • e. or containing only aggregate data • e..g. word transpositions..

Galhardas . – human/hardware/software problems • Noisy data comes from: – data collection: faulty instruments – data entry: human or computer errors – data transmission • Inconsistent (and redundant) data comes from: – Different data sources. and transformation comprises the majority of the work of building a data warehouse • No quality data. no quality decisions! – Quality decisions must be based on quality data (e. cleaning. so non uniform naming conventions/data codes – Functional dependency and/or referential integrity violation SAD 2007/08 H. Why Is Data Dirty? • Incomplete data comes from: – non available data value when collected – different criteria between the time when the data was collected and when it is analyzed.Galhardas Why Is Data Cleaning Important? • Data warehouse needs consistent integration of quality data – Data extraction.. duplicate or missing data may cause incorrect or even misleading statistics) SAD 2007/08 H.g.

”fuzzy” joins on not-quite-matching keys SAD 2007/08 H. etc. addresses.Galhardas Data quality vs cleaning • Data quality = Data cleaning + – Data enrichment • enhancing the value of internally held data by appending related attributes from external sources (for example. e. – Data profiling • analysis of data to capture statistics (metadata) that provide insight into the quality of the data and aid in the identification of data quality issues..g. – Remove duplicates. consumer demographic attributes or geographic descriptors). Types of data cleaning • Conversion.Galhardas . – Data monitoring • deployment of controls to ensure ongoing conformance of data to business rules that define data quality for the organization. – Data stewards responsible for data quality – DW-controlled improvement – Source-controlled improvement – Construct programs to check data quality SAD 2007/08 H. duplicate customers – Domain-independent cleansing • Approximate. date formats. etc. parsing and normalization – Text coding. – Most common type of cleansing • Special-purpose cleansing – Normalize spellings of names.

e.g.Galhardas .Galhardas ETL tools ETL tools from the big vendors.…) The ”best” tool does not exist – � Choose based on your own needs – � Check first if the ”standard tools”from the big vendors are ok SAD 2007/08 H.. – � Oracle Warehouse Builder – � IBM DB2 Warehouse Manager – � Microsoft Integration Services Offer much functionality at a reasonable price (included…) – Data modeling � – ETL code generation � – Scheduling DW jobs � – … � Many others – � Hundreds of tools – � Often specialized tools for certain jobs (insurance cleansing. Overview • General ETL issues: – Data staging area (DSA) – Building dimensions – Building fact tables – Extract – Load – Transformation/cleaning • Commercial (and open-source) tools • The AJAX data cleaning and transformation framework SAD 2007/08 H.

. 2007 • Some open source ETL tools: – Talend – Enhydra Octopus – Clover.g.Galhardas Application context  Integrate data from different sources • E.etltool.Galhardas . ETL and data quality tools • http://www.g..com/ • Magic Quadrant for Data Quality Tools.. processing data collected from the Web SAD 2007/08 H.g.g..ETL • Not so many open source quality/cleaning tools SAD 2007/08 H. duplicates in a file of customers • Migrate data from a source schema into a different fixed target schema • E. discontinued application packages • Convert poorly structured data into structured data • E.populating a DW from different operational data stores • Eliminate errors and duplicates within a single source • E.

name) Events(eventKey. city. volume. eventKey. number. url. title. year) Authors(authorKey. name) PubsAuthors(pubKey. month.Galhardas . The AJAX data transformation and cleaning framework SAD 2007/08 H. authorKey) Data Cleaning & Transformation DirtyData(paper:String) SAD 2007/08 H. pages.Galhardas Motivating example (1) Publications(pubKey.

DirtyEvents DirtyAuthors Extraction A data cleaning process is modeled by a directed acyclic graph of data transformations Standardization Cities Tags Formattin g SAD 2007/08 H. Widom. Making views self- maintianable for data warehousing. A. PubsAuthors PDIS | Conference on Parallel and QGMW96 | DQua Distributed Information Systems Data Cleaning & QGMW96 | AGup DirtyData Transformation …. Making Views Self-Maintainable for Data Warehousing. Mumick. Quass. Gupta. [1] Dallan Quass. Inderpal Singh Mumick. USA. Miami Beach. Ashish Gupta.. Florida. J. Motivating example (2) Authors Publications DQua | Dallan Quass QGMW96| Making Views Self-Maintainable for Data Warehousing |PDIS| null | null | AGup | Ashish Gupta null | null | Miami Beach | Florida. 1996 [2] D. PDIS’95 SAD 2007/08 H..Galhardas DirtyData . USA | 1996 Events JWid | Jennifer Widom …. and Jennifer Widom.. In Proceedings of the Conference on Parallel and Distributed Information Systems.Galhardas Modeling a data cleaning process Authors Duplicate Elimination DirtyTitles. I.

Domain 1 ..Galhardas Problems of ETL and data quality solutions (1) Data cleaning transformations App.Galhardas . Domain 3 The semantics of some data transformations is defined in terms of their implementation algorithms SAD 2007/08 H. Existing technology • Ad-hoc programs written in a programming language like C or Java or using an RDBMS proprietary language – Programs difficult to optimize and maintain • Data transformation scripts using an ETL (Extraction-Transformation-Loading) or a data quality tool SAD 2007/08 H.. Domain 2 App. App.

Galhardas . Problems of ETL and data quality solutions (2) Clean data Rejected data Cleaning process Dirty Data There is a lack of interactive facilities to tune a data cleaning application program SAD 2007/08 H.Galhardas AJAX features • An extensible data quality framework – Logical operators as extensions of relational algebra – Physical execution algorithms • A declarative language for logical operators – SQL extension • A debugger facility for tuning a data cleaning program application – Based on a mechanism of exceptions SAD 2007/08 H.

Galhardas . AJAX features • An extensible data quality framework – Logical operators as extensions of relational algebra – Physical execution algorithms • A declarative language for logical operators – SQL extension • A debugger facility for tuning a data cleaning program application – Based on a mechanism of exceptions SAD 2007/08 H.Galhardas Logical level: parametric operators • View: arbitrary SQL query View Cluster • Map: iterator-based one-to-many mapping with arbitrary user-defined functions • Match: iterator-based approximate join Map Match • Cluster: uses an arbitrary clustering function • Merge: extends SQL group-by with user-defined Merge Apply aggregate functions • Apply: executes an arbitrary user-defined algorithm SAD 2007/08 H.

Logical level Authors Duplicate Elimination DirtyTitles...Galhardas DirtyData ... DirtyAuthors DirtyAuthors Extraction Map Java Scan Standardization Map Java Scan Cities Cities Tags Tags Formattin Map SQL Scan g SAD 2007/08 DirtyData H.Galhardas Logical level Physical level Authors Authors Merge Java Scan Duplicate Elimination Cluster TC Match NL DirtyTitles. DirtyAuthors Extraction Standardization Cities Formattin Tags g SAD 2007/08 DirtyData H... DirtyTitles.

AJAX features • An extensible data quality framework – Logical operators as extensions of relational algebra – Physical execution algorithms • A declarative language for logical operators – SQL extension • A debugger facility for tuning a data cleaning program application – Based on a mechanism of exceptions SAD 2007/08 H.Galhardas Match • Input: 2 relations • Finds data records that correspond to the same real object • Calls distance functions for comparing field values and computing the distance between input tuples • Output: 1 relation containing matching tuples and possibly 1 or 2 relations containing non-matching tuples SAD 2007/08 H.Galhardas .

name.Galhardas Example Authors CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1. DirtyAuthors da2 Merge LET distance = editDistance(da1.name) WHERE distance < maxDist Cluster MatchAuthors INTO MatchAuthors Match DirtyAuthors Duplicate Elimination SAD 2007/08 H. Example Authors Merge Cluster MatchAuthors Match DirtyAuthors Duplicate Elimination SAD 2007/08 H. da2.Galhardas .

name2) SAD 2007/08 H. authorKey2.. name1. name) 861|johann christoph freytag Duplicate Elimination 822|jc freytag 819|j freytag 814|j-c freytag Output: MatchAuthors(authorKey1. Example Authors CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1. s2) < maxDist SAD 2007/08 H. Implementation of the match operator ∀ s1∈ S1.Galhardas 861|822|johann christoph freytag| jc freytag 822|814|jc freytag|j-c freytag . DirtyAuthors da2 Merge LET distance = editDistance(da1.Galhardas ..name. s2) is a match if editDistance (s1.name) WHERE distance < maxDist Cluster MatchAuthors INTO MatchAuthors Match Input: DirtyAuthors DirtyAuthors(authorKey. da2. s2 ∈ S2 (s1.

· No optimization supported for a Cartesian product with external function calls SAD 2007/08 H.Galhardas A database solution CREATE TABLE MatchAuthors AS SELECT authorKey1. a2. distance FROM (SELECT a1.name. DirtyAuthors a2) WHERE distance < maxDist. Nested loop S1 S2 editDistance . authorKey2..Galhardas . editDistance (a1. • Very expensive evaluation when handling large amounts of data Þ Need alternative execution algorithms for the same logical specification SAD 2007/08 H..authorKey authorKey1.authorKey authorKey2.name) distance FROM DirtyAuthors a1. a2.

Galhardas Window scanning S n SAD 2007/08 H.Galhardas . Window scanning S n SAD 2007/08 H.

1 John Smith Jogn Smith length length John Smithe length + 1 editDistance maxDist = 1 SAD 2007/08 H.Galhardas . Window scanning S n · May loose some matches SAD 2007/08 H.Galhardas String distance filtering S1 S2 John Smit length.

removeCitationTags(STRING) RETURN STRING (600) DEFINE ALGORITHMS AS TransitiveClosure SourceClustering(STRING) DEFINE INPUT DATA FLOWS AS TABLE DirtyData (paper STRING (400) ).paper AS paper KEY paperKey CONSTRAINT NOT NULL m apKeDiDa.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerException Generate. TABLE City (city STRING (80).Galhardas Declarative specification DEFINE FUNCTIONS AS Choose.citysyn.Galhardas .name) WHERE dist < maxDist % distance-filtering: map= length.paper } SAD 2007/08 H. Dd. dist = abs % INTO MatchAuthors SAD 2007/08 H.name. DEFINE TRANSFORMATIONS AS CREATE MAPPING m apKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey. Annotation-based optimization • The user specifies types of optimization • The system suggests which algorithm to use Ex: CREATE MATCHING MatchDirtyAuthors FROM DirtyAuthors da1. citysyn STRING (80) ) KEY city. da2.generateId(INTEGER) RETURN STRING Normal. DirtyAuthors da2 LET dist = editDistance(da1.

Galhardas Management of exceptions • Problem: to mark tuples not handled by the cleaning criteria of an operator • Solution: to specify the generation of exception tuples within a logical operator – exceptions are thrown by external functions – output constraints are violated SAD 2007/08 H. AJAX features • An extensible data quality framework – Logical operators as extensions of relational algebra – Physical execution algorithms • A declarative language for logical operators – SQL extension • A debugger facility for tuning a data cleaning program application – Based on a mechanism of exceptions SAD 2007/08 H.Galhardas .

Galhardas .Galhardas Architecture SAD 2007/08 H. in the future. Debugger facility • Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions • Supports the interactive data modification and. the incremental execution of logical operators SAD 2007/08 H.

Galhardas ... To see it working.. Workshop-UQ? Tomorrow at Tagus Park. 10H-16H SAD 2007/08 H.