Professional Documents
Culture Documents
Etl PDF
Etl PDF
Aalborg University 2007 - DWML course 5 Aalborg University 2007 - DWML course 6
Aalborg University 2007 - DWML course 7 Aalborg University 2007 - DWML course 8
Building Fact Tables Types of Data Sources
Two types of load
Initial load
Non-cooperative sources
ETL for all data up till now
Snapshot sources provides only full copy of source, e.g., files
Done when DW is started the first time
Specific sources each is different, e.g., legacy systems
Very heavy - large data volumes
Logged sources writes change log, e.g., DB log
Incremental update Queryable sources provides query interface, e.g., RDBMS
Move only changes since last load
Cooperative sources
Done periodically (e.g., month or week) after DW start
Replicated sources publish/subscribe mechanism
Less heavy - smaller data volumes
Call back sources calls external code (ETL) when changes occur
Dimensions must be updated before facts Internal action sources only internal actions when changes occur
The relevant dimension rows for new facts must be in place DB triggers is an example
Special key considerations if initial load must be performed again
Extract strategy depends on the source types
Aalborg University 2007 - DWML course 9 Aalborg University 2007 - DWML course 10
Aalborg University 2007 - DWML course 11 Aalborg University 2007 - DWML course 12
Changed Data Capture Common Transformations
Messages
Applications insert messages in a queue at updates
+ Works for all types of updates and systems Data type conversions
- Operational applications must be changed+operational overhead
EBCDIC ASCII/UniCode
DB triggers String manipulations
Triggers execute actions on INSERT/UPDATE/DELETE Date/time format conversions
+ Operational applications need not be changed Normalization/denormalization
+ Enables real-time update of DW
To the desired DW format
- Operational overhead
Depending on source format
Replication based on DB log Building keys
Find changes directly in DB log which is written anyway
Table matches production keys to surrogate DW keys
+ Operational applications need not be changed
Correct handling of history - especially for total reload
+ No operational overhead
- Not possible in some DBMS
Aalborg University 2007 - DWML course 13 Aalborg University 2007 - DWML course 14
Data almost never has decent quality BI does not work on raw data
Data in DW must be: Pre-processing necessary for BI analysis
Precise Handle inconsistent data formats
DW data must match known numbers - or explanation needed
Spellings, codings,
Complete
DW has all relevant data and the users know Remove unnecessary attributes
Consistent Production keys, comments,
No contradictory data: aggregates fit with detail data Replace codes with text (Why?)
Unique City name instead of ZIP code,
The same things is called the same and has the same key Combine data from multiple sources with common key
(customers)
E.g., customer data from customer address, customer name,
Timely
Data is updated frequently enough and the users know when
Aalborg University 2007 - DWML course 15 Aalborg University 2007 - DWML course 16
Types Of Cleansing Cleansing
Conversion and normalization
Text coding, date formats, etc.
Mark facts with Data Status dimension
Most common type of cleansing
Normal, abnormal, outside bounds, impossible,
Special-purpose cleansing Facts can be taken in/out of analyses
Normalize spellings of names, addresses, etc.
Uniform treatment of NULL
Remove duplicates, e.g., duplicate customers
Use explicit NULL value rather than special value (0,-1,)
Domain-independent cleansing Use NULLs only for measure values (estimates instead?)
Approximate, fuzzy joins on records from different sources
Use special dimension keys for NULL dimension values
Rule-based cleansing Avoid problems in joins, since NULL is not equal to NULL
User-specifed rules, if-then style Mark facts with changed status
Automatic rules: use data mining to find patterns in data New customer, Customer about to cancel contract,
Guess missing sales person based on customer and item
Aalborg University 2007 - DWML course 17 Aalborg University 2007 - DWML course 18
Aalborg University 2007 - DWML course 19 Aalborg University 2007 - DWML course 20
Load ETL Tools
ETL tools from the big vendors
Relationships in the data
Oracle Warehouse Builder
Referential integrity and data consistency must be ensured (Why?)
IBM DB2 Warehouse Manager
Can be done by loader
Microsoft Integration Services
Aggregates Offers much functionality at a reasonable price
Can be built and loaded at the same time as the detail data
Data modeling
Load tuning ETL code generation
Load without log Scheduling DW jobs
Sort load file first
Make only simple transformations in loader
The best tool does not exist
Use loader facilities for building aggregates
Choose based on your own needs
Should DW be on-line 24*7? Check first if the standard tools from the big vendors are OK
Use partitions or several sets of tables (like MS Analysis)
Aalborg University 2007 - DWML course 21 Aalborg University 2007 - DWML course 22
Aalborg University 2007 - DWML course 23 Aalborg University 2007 - DWML course 24
Integration Services (IS) Packages
Aalborg University 2007 - DWML course 25 Aalborg University 2007 - DWML course 26
Aalborg University 2007 - DWML course 29 Aalborg University 2007 - DWML course 30
Aalborg University 2007 - DWML course 31 Aalborg University 2007 - DWML course 32
ETL Demo
Aalborg University 2007 - DWML course 33 Aalborg University 2007 - DWML course 34
Aalborg University 2007 - DWML course 35 Aalborg University 2007 - DWML course 36
Summary