Professional Documents
Culture Documents
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 2
Outline
Motivation
Conceptual Model
Instantiation and Specialization Layers
Methodology for the usage of the
conceptual model
Conclusions and Future Work
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 3
Extract-Transform-Load (ETL)
Extract Transform Load
& Clean
Sources DSA DW
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 4
Motivation
Practical necessity
e.g., 80% of the development time in a DW project
In-house development, ad-hoc solutions
Lack of related work
The front end of the DW has monopolized the research
on the conceptual part of DW modeling
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 5
Motivation
Early stages of the DW design :
Concepts are still fuzzy and changing
frequently
Lots of interviews with people
No time for a full, clean-cut definition of the
DW and the ETL workflow
Still, we can:
Trace the mapping of the attributes of the
data sources to the attributes of the DW
tables
PK
Trace necessary constraints and S1.A DW.A
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 7
Conceptual Model
Entities of our model:
Concepts
Attributes
Part-of Relationships
Transformations
Serial Composition of Transformations
Provider Relationships
Notes
ETL Constraints
Candidate Relationships
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 8
Conceptual Model
Concepts
a name, finite set of attributes
represent an entity in the source concept
database or in the DW
Attributes
same role as in ER/dimensional attribute
models
a granular module of information
We do not employ standard UML notation for concepts and attributes, for the
reason that we need to treat attributes as first class citizens of our model
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 9
Conceptual Model
Part-of Relationships
finite set of attributes
emphasize the fact that part of
a concept is composed of
a set of attributes
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 10
Conceptual Model
Example
Source 1
S1.PARTSUPP {PKEY, SUPPKEY, QTY, COST}
Data Warehouse
DW.PARTSUPP {PKEY, SUPPKEY, DATE, QTY, COST}
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 11
Conceptual Model
S1.PARTSUPP DW.PARTSUPP
PKey PKey
SuppKey SuppKey
Date
Qty Qty
Cost Cost
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 12
Conceptual Model
Transformations
finite set of input/output transformation
attributes, a symbol
abstractions that represent
parts, or full modules of
code, executing a single task
two categories:
filtering or data cleaning operations
(e.g., foreign key violations)
transformation operations
(e.g., aggregation)
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 13
Conceptual Model
Provider Relationships
finite set of input/output attributes, an
appropriate transformation
map a set of input attributes to a set of
output attributes through a relevant
transformation*
provider provider
1:1 N:M
* If the attributes are semantically and physically compatible, no transformation is required
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 14
Conceptual Model
S1.PARTSUPP DW.PARTSUPP
PKey SK PKey
SuppKey SuppKey
f Date
Qty Qty
Cost NN Cost
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 15
Conceptual Model
Notes
informal tags, exactly as in
UML modeling Note
used for:
simple comments explaining
design decisions
explanation of the semantics
of the applied transformation
tracing of runtime constraints
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 16
Conceptual Model
S1.PARTSUPP DW.PARTSUPP
PKey SK PKey
SuppKey SuppKey
f Date
Qty Qty
Cost NN Cost
Date = SysDate()
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 17
Conceptual Model
ETL Constraints
finite set of attributes, a
single transformation
ETL_constraint
express the fact that the
data of a certain concept
fulfill several requirements
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 18
Conceptual Model
S1.PARTSUPP PK
DW.PARTSUPP
PKey SK PKey
SuppKey SuppKey
f Date
Qty Qty
Cost NN Cost
Date = SysDate()
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 19
Conceptual Model
Candidate Relationships
a single candidate concept, a single target concept
used when a certain DW concept is populated by a
finite set of more than one candidate source concepts
Active Candidate Relationship
a certain candidate that has been selected for the
population of the target concept
a specialization of candidate relationships
active canditate
candidate1 target
...
candidaten {XOR}
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 20
Conceptual Model
Due to acccuracy
Necessary providers:
and small size
S1 and S2
(< update window)
S1.PartSupp
Annual
U DW.PartSupp
PartSupp’s
S2.PartSupp
Recent {XOR}
PartSupp’s
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 21
Conceptual Model
Necessary providers:
{Duration<4h}
Due to acccuracy S1 and S2
and small size
(< update window)
U
Annual
PartSupp’s
S2.PARTSUPP DW.PARTSUPP PK
S1.PARTSUPP
Recent {XOR}
PartSupp’s
y
Ke
SuppKey .P y SuppKey SuppKey
S2 pKe
Sup
S2.
Qty γ
S2.Date Date f
SUM
SU (S2.Q
M( ty)
Date f S2 Qty Qty
.C
os
t)
Cost f
American to
$2€ Date = SysDate()
European Date
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 22
Outline
Motivation
Conceptual Model
Instantiation and Specialization Layers
Methodology for the usage of the
conceptual model
Conclusions and Future Work
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 23
Instantiation & Specialization
Layers
The key issues:
generecity
identification of a small set of generic constructs to
capture all cases
usability
construction of a ‘palette’ of frequently used types
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 24
Instantiation & Specialization
Layers
Metamodel layer
a set of generic entities, able to represent any ETL
scenario
involves classes: Concept, Attribute, Transformation, ETL
Constraint and Relationship
Template layer
a set of ‘built-in’ specializations of the entities of the
Metamodel layer, specifically tailored for the most
frequent elements of ETL scenarios
Schema layer
a specific ETL scenario
all the entities of the Schema layer are instances of the
classes of the Metamodel layer
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 25
Instantiation & Specialization
Layers
Concept Attribute Transformation ETL_Constraint Relationship
Metamodel
Layer
IsA
Part Of
Fact Table Dimension Candidate
American to
$2€ Serial
ER European Date
ER Entity Composition
Relationship Surrogate Key
Aggregation
Assignment
Template Provider
Layer
InstanceOf
Candidate SK
1
f
S2.PartSupp DW.PartSupp
γ
Candidate f
2
Schema
Layer
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 26
Instantiation & Specialization
Layers
Template layer
Four groups of logical transformations
Filters
Unary transformations
Binary transformations
Composite transformations
Two groups of physical transformations
Transfer operations
File operations
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 27
Instantiation & Specialization
Layers
Filters Composite transformations
Selection (σ) Slowly changing dimension (Type
Not null (NN) 1,2,3) (SDC-1/2/3)
Primary key violation (PK) Format mismatch (FM)
Foreign key violation (FK) Data type conversion (DTC)
Unique value (UN) Switch (σ*)
Domain mismatch DM) Extended union (U)
Unary transformations File operations
Push EBCDIC to ASCII conversion (EB2AS)
Aggregation (γ) Sort file (Sort)
Projection (π) Transfer operations
Function application (f) Ftp (FTP)
Surrogate key assignment(SK) Compress/Decompress (Z/dZ)
Tuple normalization (N) Encrypt/Decrypt (Cr/dCr)
Tuple denormalization (DN)
Binary transformations
Union (U)
Join ()
Diff (Δ)
Update Detection (ΔUPD)
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 28
Outline
Introduction
Motivation
Conceptual Model
Instantiation and Specialization Layers
Methodology for the usage of the
conceptual model
Conclusions and Future Work
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 29
Methodology
Step 1
Identification of the proper data stores
Step 2
Candidates and active candidates for the
involved data stores
Step 3
Attribute mapping between the providers and
the consumers
Step 4
Annotating the diagram with runtime
constraints
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 30
Outline
Introduction
Motivation
Conceptual Model
Instantiation and Specialization Layers
Methodology for the usage of the
conceptual model
Conclusions and Future Work
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 31
Conclusions
Our contributions lies in:
The proposal of a novel conceptual model
which is customized for the tracing of inter-
attribute relationships and the respective ETL
activities
A customizable and extensible construction
The introduction of a 'palette' of a set of
frequently used ETL activities
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 32
On-going/Future Work
The Arktos II project is aimed towards the
Conceptual modeling
Logical modeling
Optimization
What-if analysis
of ETL scenarios
http://www.dblab.ece.ntua.gr/
~pvassil/projects/arktos_II
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 33
Thank you
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 34
Back-up slides
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 35
Logical Model [DMDW’02]
DS.PS1.PKEY,
DS.PS_NEW1.PKEY,
SUPPKEY=1 LOOKUP_PS.SKEY,
DS.PS_OLD1.PKEY COST DATE
SUPPKEY
DS.PS_NEW1
DS.PS_NEW2.PKEY, DS.PS2.PKEY,
DS.PS_OLD2.PKEY SUPPKEY=2 LOOKUP_PS.SKEY, COST DATE=SYSDATE QTY>0
SUPPKEY
DS.PS_NEW2
FTP1 Aggregate1 V1
S1_PARTSUPP DW.PARTSUPP
FTP2 Aggregate2 V2
S2_PARTSUPP TIME
Sources DW
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 36
Conceptual Model
concept attribute transformation Note
ETL_constraint
active canditate
candidate1 target
part of ...
candidaten {XOR}
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 37
The lifecycle of a Data Warehouse
and its ETL processes
Administration
of DW
Metrics
Logical
Logical Design
Model for Tuning – Physical
DW, Sources Full Activity Model for
& Activities Description DW, Sources
& Activities
Reverse Engineering
Conceptual of Sources &
Model for Software Software &
Requirements Construction SW Metrics
DW, Sources Collection
& Activities
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 38
Conceptual Model
«metaclass» «metaclass»
1 +transformation PartOf
ETL_Constraint
+attributes 1
1
1 «metaclass»
Serial Composition
1 1
«metaclass» +initiating
Transformation 1
1 +name * +consequent
+symbol 1
«metaclass»
+input Provider
+transformation
1 +output
*
«metaclass» * Tag
Attribute
* +name +content 1 1 +input
+output
* *
«metaclass»
Relationship
1 «metaclass» 1 1 «metaclass»
Concept Candidate
+name
+schema -candidate
1
«metaclass»
1 -target
Active Candidate
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 39
Conceptual Model
General Notes
It is not a process/workflow model
It is orthogonal to the conceptual models
which are available for the modeling of DW
star schemata
It is specifically tailored for the back end of the
DW
Any of the proposals for the DW front end can
be combined with our approach
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 40
Conceptual Model
Serial Composition of
Transformations
a single initiating
transformation, a single
subsequent transformation serial
composition
combine several
transformations in a single
provider relationship
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 41