You are on page 1of 41

Conceptual Modeling

for ETL processes

Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos


{pvassil,asimi,spiros}@dblab.ece.ntua.gr

National Technical University of Athens


KDBS Laboratory
http://www.dbnet.ece.ntua.gr
General Idea
 The problem:
 The conceptual part of the definition of ETL
process in the early stages of a DW project
 The key idea:
 The mapping of the attributes of the data
sources to the attributes of the DW tables

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 2
Outline
 Motivation
 Conceptual Model
 Instantiation and Specialization Layers
 Methodology for the usage of the
conceptual model
 Conclusions and Future Work

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 3
Extract-Transform-Load (ETL)
Extract Transform Load
& Clean

Sources DSA DW

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 4
Motivation
 Practical necessity
 e.g., 80% of the development time in a DW project
 In-house development, ad-hoc solutions
 Lack of related work
 The front end of the DW has monopolized the research
on the conceptual part of DW modeling

Thus, the design, development and deployment


of ETL processes, needs modeling, design and
methodological foundations

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 5
Motivation
 Early stages of the DW design :
 Concepts are still fuzzy and changing
frequently
 Lots of interviews with people
 No time for a full, clean-cut definition of the
DW and the ETL workflow
 Still, we can:
 Trace the mapping of the attributes of the
data sources to the attributes of the DW
tables
PK
 Trace necessary constraints and S1.A DW.A

transformations for the ETL process


Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 6
Outline
 Motivation
 Conceptual Model
 Instantiation and Specialization Layers
 Methodology for the usage of the
conceptual model
 Conclusions and Future Work

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 7
Conceptual Model
 Entities of our model:
 Concepts
 Attributes
 Part-of Relationships
 Transformations
 Serial Composition of Transformations
 Provider Relationships
 Notes
 ETL Constraints
 Candidate Relationships

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 8
Conceptual Model
 Concepts
 a name, finite set of attributes
 represent an entity in the source concept

database or in the DW
 Attributes
 same role as in ER/dimensional attribute
models
 a granular module of information

We do not employ standard UML notation for concepts and attributes, for the
reason that we need to treat attributes as first class citizens of our model

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 9
Conceptual Model
 Part-of Relationships
 finite set of attributes
 emphasize the fact that part of
a concept is composed of
a set of attributes

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 10
Conceptual Model
 Example
 Source 1
 S1.PARTSUPP {PKEY, SUPPKEY, QTY, COST}
 Data Warehouse
 DW.PARTSUPP {PKEY, SUPPKEY, DATE, QTY, COST}

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 11
Conceptual Model
S1.PARTSUPP DW.PARTSUPP

PKey PKey

SuppKey SuppKey

Date

Qty Qty

Cost Cost

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 12
Conceptual Model
 Transformations
 finite set of input/output transformation
attributes, a symbol
 abstractions that represent
parts, or full modules of
code, executing a single task
 two categories:
 filtering or data cleaning operations
(e.g., foreign key violations)
 transformation operations
(e.g., aggregation)

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 13
Conceptual Model
 Provider Relationships
 finite set of input/output attributes, an
appropriate transformation
 map a set of input attributes to a set of
output attributes through a relevant
transformation*

provider provider
1:1 N:M
* If the attributes are semantically and physically compatible, no transformation is required

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 14
Conceptual Model
S1.PARTSUPP DW.PARTSUPP

PKey SK PKey

SuppKey SuppKey

f Date

Qty Qty

Cost NN Cost

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 15
Conceptual Model
 Notes
 informal tags, exactly as in
UML modeling Note

 used for:
 simple comments explaining
design decisions
 explanation of the semantics
of the applied transformation
 tracing of runtime constraints

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 16
Conceptual Model
S1.PARTSUPP DW.PARTSUPP

PKey SK PKey

SuppKey SuppKey

f Date

Qty Qty

Cost NN Cost

Date = SysDate()

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 17
Conceptual Model
 ETL Constraints
 finite set of attributes, a
single transformation
ETL_constraint
 express the fact that the
data of a certain concept
fulfill several requirements

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 18
Conceptual Model
S1.PARTSUPP PK
DW.PARTSUPP

PKey SK PKey

SuppKey SuppKey

f Date

Qty Qty

Cost NN Cost

Date = SysDate()

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 19
Conceptual Model
 Candidate Relationships
 a single candidate concept, a single target concept
 used when a certain DW concept is populated by a
finite set of more than one candidate source concepts
Active Candidate Relationship
 a certain candidate that has been selected for the
population of the target concept
 a specialization of candidate relationships
active canditate

candidate1 target
...
candidaten {XOR}

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 20
Conceptual Model

Due to acccuracy
Necessary providers:
and small size
S1 and S2
(< update window)

S1.PartSupp

Annual
U DW.PartSupp
PartSupp’s

S2.PartSupp

Recent {XOR}
PartSupp’s

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 21
Conceptual Model
Necessary providers:
{Duration<4h}
Due to acccuracy S1 and S2
and small size
(< update window)
U

Annual
PartSupp’s

S2.PARTSUPP DW.PARTSUPP PK
S1.PARTSUPP
Recent {XOR}
PartSupp’s

PKey SK PKey SK PKey

y
Ke
SuppKey .P y SuppKey SuppKey
S2 pKe
Sup
S2.
Qty γ
S2.Date Date f
SUM
SU (S2.Q
M( ty)
Date f S2 Qty Qty
.C
os
t)

Department Cost NN Cost

Cost f

American to
$2€ Date = SysDate()
European Date

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 22
Outline
 Motivation
 Conceptual Model
 Instantiation and Specialization Layers
 Methodology for the usage of the
conceptual model
 Conclusions and Future Work

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 23
Instantiation & Specialization
Layers
 The key issues:
 generecity
 identification of a small set of generic constructs to
capture all cases
 usability
 construction of a ‘palette’ of frequently used types

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 24
Instantiation & Specialization
Layers
 Metamodel layer
 a set of generic entities, able to represent any ETL
scenario
 involves classes: Concept, Attribute, Transformation, ETL
Constraint and Relationship
 Template layer
 a set of ‘built-in’ specializations of the entities of the
Metamodel layer, specifically tailored for the most
frequent elements of ETL scenarios
 Schema layer
 a specific ETL scenario
 all the entities of the Schema layer are instances of the
classes of the Metamodel layer
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 25
Instantiation & Specialization
Layers
Concept Attribute Transformation ETL_Constraint Relationship

Metamodel
Layer
IsA
Part Of
Fact Table Dimension Candidate
American to
$2€ Serial
ER European Date
ER Entity Composition
Relationship Surrogate Key
Aggregation
Assignment
Template Provider
Layer

InstanceOf

Candidate SK
1
f
S2.PartSupp DW.PartSupp
γ

Candidate f
2

Schema
Layer

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 26
Instantiation & Specialization
Layers
 Template layer
 Four groups of logical transformations
 Filters
 Unary transformations
 Binary transformations
 Composite transformations
 Two groups of physical transformations
 Transfer operations
 File operations

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 27
Instantiation & Specialization
Layers
Filters Composite transformations
Selection (σ) Slowly changing dimension (Type
Not null (NN) 1,2,3) (SDC-1/2/3)
Primary key violation (PK) Format mismatch (FM)
Foreign key violation (FK) Data type conversion (DTC)
Unique value (UN) Switch (σ*)
Domain mismatch DM) Extended union (U)
Unary transformations File operations
Push EBCDIC to ASCII conversion (EB2AS)
Aggregation (γ) Sort file (Sort)
Projection (π) Transfer operations
Function application (f) Ftp (FTP)
Surrogate key assignment(SK) Compress/Decompress (Z/dZ)
Tuple normalization (N) Encrypt/Decrypt (Cr/dCr)
Tuple denormalization (DN)
Binary transformations
Union (U)
Join ()
Diff (Δ)
Update Detection (ΔUPD)

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 28
Outline
 Introduction
 Motivation
 Conceptual Model
 Instantiation and Specialization Layers
 Methodology for the usage of the
conceptual model
 Conclusions and Future Work

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 29
Methodology
 Step 1
 Identification of the proper data stores
 Step 2
 Candidates and active candidates for the
involved data stores
 Step 3
 Attribute mapping between the providers and
the consumers
 Step 4
 Annotating the diagram with runtime
constraints

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 30
Outline
 Introduction
 Motivation
 Conceptual Model
 Instantiation and Specialization Layers
 Methodology for the usage of the
conceptual model
 Conclusions and Future Work

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 31
Conclusions
 Our contributions lies in:
 The proposal of a novel conceptual model
which is customized for the tracing of inter-
attribute relationships and the respective ETL
activities
 A customizable and extensible construction
 The introduction of a 'palette' of a set of
frequently used ETL activities

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 32
On-going/Future Work
The Arktos II project is aimed towards the
 Conceptual modeling
 Logical modeling
 Optimization
 What-if analysis

of ETL scenarios

http://www.dblab.ece.ntua.gr/
~pvassil/projects/arktos_II

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 33
Thank you

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 34
Back-up slides

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 35
Logical Model [DMDW’02]
DS.PS1.PKEY,
DS.PS_NEW1.PKEY,
SUPPKEY=1 LOOKUP_PS.SKEY,
DS.PS_OLD1.PKEY COST DATE
SUPPKEY
DS.PS_NEW1

DIFF1 Add_SPK1 SK1 $2€ A2EDate


DS.PS1
U
rejected rejected rejected
DS.PS_OLD1
Log Log Log

DS.PS_NEW2.PKEY, DS.PS2.PKEY,
DS.PS_OLD2.PKEY SUPPKEY=2 LOOKUP_PS.SKEY, COST DATE=SYSDATE QTY>0
SUPPKEY
DS.PS_NEW2

DIFF2 Add_SPK2 SK2 NotNULL AddDate CheckQTY


DS.PS2
rejected rejected
DS.PS_OLD2
Log Log
DSA
PKEY, DAY
MIN(COST)

FTP1 Aggregate1 V1
S1_PARTSUPP DW.PARTSUPP

DW.PARTSUPP.DATE, PKEY, MONTH


DAY AVG(COST)

FTP2  Aggregate2 V2
S2_PARTSUPP TIME

Sources DW
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 36
Conceptual Model
concept attribute transformation Note
ETL_constraint

provider provider serial


1:1 N:M composition

active canditate

candidate1 target
part of ...
candidaten {XOR}

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 37
The lifecycle of a Data Warehouse
and its ETL processes
Administration
of DW
Metrics

Logical
Logical Design
Model for Tuning – Physical
DW, Sources Full Activity Model for
& Activities Description DW, Sources
& Activities

Reverse Engineering
Conceptual of Sources &
Model for Software Software &
Requirements Construction SW Metrics
DW, Sources Collection
& Activities

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 38
Conceptual Model
«metaclass» «metaclass»
1 +transformation PartOf
ETL_Constraint

+attributes 1

1
1 «metaclass»
Serial Composition
1 1
«metaclass» +initiating
Transformation 1
1 +name * +consequent
+symbol 1
«metaclass»
+input Provider
+transformation
1 +output
*
«metaclass» * Tag
Attribute
* +name +content 1 1 +input
+output

* *

«metaclass»
Relationship
1 «metaclass» 1 1 «metaclass»
Concept Candidate
+name
+schema -candidate

1
«metaclass»
1 -target
Active Candidate

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 39
Conceptual Model
 General Notes
 It is not a process/workflow model
 It is orthogonal to the conceptual models
which are available for the modeling of DW
star schemata
 It is specifically tailored for the back end of the
DW
 Any of the proposals for the DW front end can
be combined with our approach

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 40
Conceptual Model
 Serial Composition of
Transformations
 a single initiating
transformation, a single
subsequent transformation serial
composition
 combine several
transformations in a single
provider relationship

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 41

You might also like