You are on page 1of 14

Journal of King Saud University – Computer and Information Sciences (2011) 23, 91–104

King Saud University


Journal of King Saud University –
Computer and Information Sciences
www.ksu.edu.sa
www.sciencedirect.com

ORIGINAL ARTICLE

A proposed model for data warehouse ETL processes


a,*
Shaker H. Ali El-Sappagh , Abdeltawab M. Ahmed Hendawi b,
Ali Hamed El Bastawissy b

a
Mathematics Department, College of Science, King Saud University, Saudi Arabia
b
Information Systems Department, Faculty of Computers and Information, Cairo University, Cairo, Egypt

Received 19 September 2010; accepted 23 February 2011


Available online 8 May 2011

KEYWORDS Abstract Extraction–transformation–loading (ETL) tools are pieces of software responsible for
Data warehouse; the extraction of data from several sources, its cleansing, customization, reformatting, integration,
ETL processes; and insertion into a data warehouse. Building the ETL process is potentially one of the biggest tasks
Database; of building a warehouse; it is complex, time consuming, and consume most of data warehouse pro-
Data mart; ject’s implementation efforts, costs, and resources. Building a data warehouse requires focusing clo-
OLAP; sely on understanding three main areas: the source area, the destination area, and the mapping area
Conceptual modeling (ETL processes). The source area has standard models such as entity relationship diagram, and the
destination area has standard models such as star schema, but the mapping area has not a standard
model till now. In spite of the importance of ETL processes, little research has been done in this
area due to its complexity. There is a clear lack of a standard model that can be used to represent
the ETL scenarios. In this paper we will try to navigate through the efforts done to conceptualize

Abbreviations: ETL, extraction–transformation–loading; DW, data


warehouse; DM, data mart; OLAP, on-line analytical processing; DS,
data sources; ODS, operational data store; DSA, data staging area;
DBMS, database management system; OLTP, on-line transaction
processing; CDC, change data capture; SCD, slowly changing
dimension; FCME, first-class modeling elements; EMD, entity
mapping diagram; DSA, data storage area.
* Corresponding author.
E-mail addresses: Sker@ksu.edu.sa (S.H. Ali El-Sappagh),
Abdeltawab_fci@yahoo.com (A.M. Ahmed Hendawi), aelbastawesy@
fci-cu.edu.eg (A.H. El Bastawissy).

1319-1578 ª 2011 King Saud University. Production and hosting by


Elsevier B.V. All rights reserved.

Peer review under responsibility of King Saud University.


doi:10.1016/j.jksuci.2011.05.005

Production and hosting by Elsevier


92 S.H. Ali El-Sappagh et al.

the ETL processes. Research in the field of modeling ETL processes can be categorized into three
main approaches: Modeling based on mapping expressions and guidelines, modeling based on con-
ceptual constructs, and modeling based on UML environment. These projects try to represent the
main mapping activities at the conceptual level. Due to the variation and differences between the
proposed solutions for the conceptual design of ETL processes and due to their limitations, this
paper also will propose a model for conceptual design of ETL processes. The proposed model is
built upon the enhancement of the models in the previous models to support some missing mapping
features.
ª 2011 King Saud University. Production and hosting by Elsevier B.V. All rights reserved.

1. Introduction 2. ETL modeling concepts

A data warehouse (DW) is a collection of technologies aimed The general framework for ETL processes is shown in Fig. 1.
at enabling the decision maker to make better and faster deci- Data is extracted from different data sources, and then prop-
sions. Data warehouses differ from operational databases in agated to the DSA where it is transformed and cleansed be-
that they are subject oriented, integrated, time variant, non fore being loaded to the data warehouse. Source, staging
volatile, summarized, larger, not normalized, and perform area, and target environments may have many different data
OLAP. The generic data warehouse architecture consists of structure formats as flat files, XML data sets, relational
three layers (data sources, DSA, and primary data warehouse) tables, non-relational sources, web log sources, legacy sys-
(Inmon, 2002; Vassiliadis, 2000). Although ETL processes area tems, and spreadsheets.
is very important, it has little research. This is because of its
difficulty and lack of formal model for representing ETL activ- 2.1. The ETL phases
ities that map the incoming data from different DSs to be in a
suitable format for loading to the target DW or DM (Kimball During the ETL process, data is extracted from an OLTP dat-
and Caserta, 2004; Demarest, 1997; Oracle Corp., 2001; In- abases, transformed to match the data warehouse schema, and
mon, 1997). To build a DW we must run the ETL tool which loaded into the data warehouse database (Berson and Smith,
has three tasks: (1) data is extracted from different data 1997; Moss, 2005). Many data warehouses also incorporate
sources, (2) propagated to the data staging area where it is data from non-OLTP systems, such as text files, legacy sys-
transformed and cleansed, and then (3) loaded to the data tems, and spreadsheets. ETL is often a complex combination
warehouse. ETL tools are a category of specialized tools with of process and technology that consumes a significant portion
the task of dealing with data warehouse homogeneity, clean- of the data warehouse development efforts and requires the
ing, transforming, and loading problems (Shilakes and Tyl- skills of business analysts, database designers, and application
man, 1998). This research will try to find a formal developers. The ETL process is not a one-time event. As data
representation model for capturing the ETL processes that sources change the data warehouse will periodically updated.
map the incoming data from different DSs to be in a suitable Also, as business changes the DW system needs to change –
format for loading to the target DW or DM. Many research in order to maintain its value as a tool for decision makers,
projects try to represent the main mapping activities at the as a result of that the ETL also changes and evolves. The
conceptual level. Our objective is to propose conceptual model ETL processes must be designed for ease of modification. A
to be used in modeling various ETL processes and cover the solid, well-designed, and documented ETL system is necessary
limitations of the previous research projects. The proposed for the success of a data warehouse project.
model will be used to design ETL scenarios, and document, An ETL system consists of three consecutive functional
customize, and simplify the tracing of the mapping between steps: extraction, transformation, and loading:
the data source attributes and its corresponding in data ware-
house. The proposed model has the following characteristics: 2.1.1. Extraction
The first step in any ETL scenario is data extraction. The ETL
– Simple: to be understood by the DW designer. extraction step is responsible for extracting data from the
– Complete: to represent all activities of the ETL processes. source systems. Each data source has its distinct set of charac-
– Customizable: to be used in different DW environments. teristics that need to be managed in order to effectively extract
data for the ETL process. The process needs to effectively inte-
We call the proposed model entity mapping diagram grate systems that have different platforms, such as different
(EMD). Also, the paper will make a survey of the previous
work done in this area. The paper will be organized as follows:
Section 2 will discuss the ETL modeling concepts. The ETL Extract Transform Load
processes related or previous work is discussed in Section 3.
We will discuss the proposed framework in Section 4. The
comparison between the previous model and proposed one is DSA
Data DW
discussed in Section 5. Next, other related works will be shown Sources
in Section 6. Finally, Section 7 shows the conclusion and fu-
ture work. Figure 1 A general framework for ETL processes.
A proposed model for data warehouse ETL processes 93

database management systems, different operating systems, to achieve the warehousing process. Queries will be used to rep-
and different communications protocols. resent the mapping between the source and the target data;
During extracting data from different data sources, the ETL thus, allowing DBMS to play an expanded role as a data trans-
team should be aware of (a) using ODBCnJDBC drivers con- formation engine as well as a data store. This approach enables
nect to database sources, (b) understand the data structure of a complete interaction between mapping metadata and the
sources, and (c) know how to handle the sources with different warehousing tool. In addition, it addresses the efficiency of a
nature such as mainframes. The extraction process consists of query-based data warehousing ETL tool without suggesting
two phases, initial extraction, and changed data extraction. In any graphical models. It describes a query generator for reus-
the initial extraction (Kimball et al., 1998), it is the first time able and more efficient data warehouse (DW) processing.
to get the data from the different operational sources to be
loaded into the data warehouse. This process is done only one 3.1.1. Mapping guideline
time after building the DW to populate it with a huge amount Mapping guideline means the set of information defined by the
of data from source systems. The incremental extraction is called developers in order to achieve the mapping between the attri-
changed data capture (CDC) where the ETL processes refresh butes of two schemas. Actually, different kinds of mapping
the DW with the modified and added data in the source systems guidelines are used for many applications. Traditionally, these
since the last extraction. This process is periodic according to guidelines are defined manually during the system implementa-
the refresh cycle and business needs. It also captures only chan- tion. In the best case, they are saved as paper documents. These
ged data since the last extraction by using many techniques as guidelines are used as references each time there is a need to
audit columns, database log, system date, or delta technique. understand how an attribute of a target schema has been gener-
ated from the sources attributes. This method is very weak in the
2.1.2. Transformation maintenance and evolution of the system. To keep updating
The second step in any ETL scenario is data transformation. these guidelines is a very hard task, especially with different ver-
The transformation step tends to make some cleaning and con- sions of guidelines. To update the mapping of an attribute in the
forming on the incoming data to gain accurate data which is system, one should include an update for the paper document
correct, complete, consistent, and unambiguous. This process guideline as well. Thus, it is extremely difficult to maintain such
includes data cleaning, transformation, and integration. It de- tasks especially with simultaneous updates by different users.
fines the granularity of fact tables, the dimension tables, DW
schema (stare or snowflake), derived facts, slowly changing 3.1.2. Mapping expressions
dimensions, factless fact tables. All transformation rules and Mapping expression of an attribute is the information needed
the resulting schemas are described in the metadata repository. to recognize how a target attribute is created from the sources
attributes. Examples of the applications where mapping
2.1.3. Loading expressions are used are listed as follows:
Loading data to the target multidimensional structure is the fi-
nal ETL step. In this step, extracted and transformed data is  Schema mapping (Madhavan et al., 2001): for database
written into the dimensional structures actually accessed by schema mapping, the mapping expression is needed to
the end users and application systems. Loading step includes define the correspondence between matched elements.
both loading dimension tables and loading fact tables.  Data warehousing tool (ETL) (Staudt et al., 1999): includes
a transformation process where the correspondence
between the sources data and the target DW data is defined.
3. Models of ETL processes  EDI message mapping: the need of a complex message trans-
lation is required for EDI, where data must be transformed
This section will navigate through the efforts done to concep- from one EDI message format into another.
tualize the ETL processes. Although the ETL processes are  EAI (enterprise application integration): the integration of
critical in building and maintaining the DW systems, there is information systems and applications needs a middleware
a clear lack of a standard model that can be used to represent to manage this process (Stonebraker and Hellerstein,
the ETL scenarios. After we build our model, we will make a 2001). It includes management rules of an enterprise’s appli-
comparison between this model and models discussed in this cations, data spread rules for concerned applications, and
section. Research in the field of modeling ETL processes can data conversion rules. Indeed, data conversion rules define
be categorized into three main approaches: the mapping expression of integrated data.

1. Modeling based on mapping expressions and guidelines. 3.1.3. Mapping expression examples
2. Modeling based on conceptual constructs. Some examples of the mapping expressions identified from dif-
3. Modeling based on UML environment. ferent type of applications are shown as follows:

In the following, a brief description of each approach is  Break-down/concatenation: in this example the value of a field
presented. is established by breaking down the value of a source and by
concatenating it with another value, as shown in Fig. 2.
3.1. Modeling ETL process using mapping expressions  Conditional mapping: sometimes the value of a target attri-
bute depends on the value of another attribute. In the exam-
Rifaieh and Benharkat (2002) have defined a model covering ple, if X = 1 then Y = A else Y = B, as shown in Fig. 3.
different types of mapping expressions. They used this model More about mapping expression rules and notation are
to create an active ETL tool. In their approach, queries are used found in Jarke et al. (2003) and Miller et al. (2000).
94 S.H. Ali El-Sappagh et al.

ized for the regular cases of ETL processes. Thus, the classes of
XYZ DD\MM\AA the template layer represent specializations (i.e., subclasses) of
1234 - XYZ the generic classes of the metamodel layer (depicted as ‘‘IsA’’
DD\MM\AA
relationships). After defining the previous framework, the
1234
AA authors present the graphical notation and the metamodel of
DDMM their proposed graphical model as shown in Fig. 5. Then, they
detail and formally define all the entities of the metamodel:
Figure 2 Example 1: Break-down/concatenation (Jarke et al.,
2003).
– Data types. Each data type T is characterized by a name
and a domain which is a countable set of values. The values
of the domains are also referred to as constants.
3.2. Modeling ETL processes using conceptual constructs – Recordsets. A recordset is characterized by its name, its log-
ical schema (structure of the recordset) and its physical
In Vassiliadis et al. (2002a, 2003, 2005) the authors attempt to extension (i.e., a finite set of records under the recordset
provide a first model towards the conceptual modeling of the schema) which is the actual records values. Any data struc-
data warehousing ETL processes. They introduce a framework ture can be treated as a ‘‘record set’’ provided that there are
for the modeling of ETL activities. Their framework contains the means to logically restructure it into a flat, typed record
three layers, as shown in Fig. 4. schema. The two most popular types of recordsets are
The lower layer namely; schema layer, involves a specific namely relational tables and record files.
ETL scenario. All the entities of the schema layer are instances – Functions. A function type comprises a name, a finite list of
of the classes data type, function type, elementary activity, parameter data types, and a single return data type. A func-
recordset and relationship. tion is an instance of a function type.
The higher layer namely; metamodel layer involves the – Elementary activities. Activities are logical abstractions rep-
aforementioned classes. The linkage between the metamodel resenting parts, or full modules of code. An abstraction of
and the schema layers is achieved through instantiation the source code of an activity is employed, in the form of
(‘‘instanceOf’’) relationships. The metamodel layer implements a LDL (logic-programming, declarative language) state-
the aforementioned generality: the five classes which are in- ment, in order to avoid dealing with the peculiarities of a
volved in the metamodel layer are generic enough to model particular programming language (Naqvi and Tsur, 1989).
any ETL scenario, through the appropriate instantiation.
The middle layer is the template layer. The constructs in the
template layer are also meta-classes, but they are quite custom-
Concept Note
12345 A Attribute Transformation
ETL_Constraint
Active Candidate
67899 B
Provider 1:1 Serial Candidate
Part Of Provider N: M
? 67899 Y Composition
2 X
Candidate

Figure 5 Notations for the conceptual modeling of ETL


Figure 3 Example 2: Conditional mapping (Jarke et al., 2003). activities (Vassiliadis et al., 2002a).

Data types Functions

Elementary Activity RecordSet Relationships


Metamodel Layer

IsA

NotNull Domain Mismatch Source Table

SK Assignment Fact Table Provider Rel


Template Layer

InstanceOf

S1.PW NN1 DM1 SK1 DW.PS

Schema Layer

Figure 4 The metamodel for the logical entities of the ETL environment (Vassiliadis et al., 2003).
A proposed model for data warehouse ETL processes 95

– Relationships. Depict the follow of data from the sources to


the target.

Then the authors use their graphical model to represent


ETL processes in a motivating example. As shown in Fig. 6,
two data sources (S1.partsupp and S2.partsupp) are used to
build the target data warehouse (DW.partsupp). The concep-
tual model of Vassiliadis et al. (2002a) is complemented in
Vassiliadis et al. (2002b, 2003) and Simitsis (2003) with the
Figure 7 Sample mapping operators.
logical design of ETL processes as data-centric workflows. In
Vassiliadis et al. (2003) the authors describe a framework for
the declarative specification of ETL scenarios. They discuss  Union, Intersection, Difference – applied to a set of objects.
the implementation issues and they present a graphical tool  Delete – delete all objects in a model.
‘ARKTOS II’ that facilitates the design of ETL scenarios,  Insert, Update – applied to individual objects in models.
based on their model. In Vassiliadis et al. (2002b) the authors
model an ETL scenario as a graph which they call architectural
graph and they introduce some notations for this graph. They 3.3. Modeling based on UML environment
introduce importance metrics to measure the degree to which
entities are bound to each other. In Simitsis (2003) the author In Lujan-Mora et al. (2004) the authors introduce their model
focuses on the optimization of the ETL processes, in order to that is based on the UML (unified modeling language) nota-
minimize the execution time of an ETL process. Regarding tions. It is known that UML does not contain a direct relation-
data mapping, in Dobre et al. (2003) authors discuss issues re- ship between attributes in different classes, but the relationship
lated to the data mapping in the integration of data, and a set is established between the classes itself, so the authors extend
of mapping operators is introduced and a classification of pos- UML to model attributes as first-class citizens. In their attempt
sible mapping cases is presented, as shown in Fig. 7. However, to provide complementary views of the design artifacts in dif-
no graphical representation of data mapping scenarios is pro- ferent levels of detail, the framework is based on a principled
vided, hence, it is difficult to be used in real world projects. In approach in the usage of UML packages, to allow zooming
Bernstein and Rahm (2000) a framework for mapping between in and out the design of a scenario.
models (objects) is proposed.
Models are manipulated by a role of high-level operations 3.3.1. Framework
including: The architecture of a data warehouse is usually depicted as
various layers of data in which data from one layer is derived
 Match – create a mapping between two models. from the data of the previous layer (Lujan-Mora and Trujillo,
 Apply Function – apply a given function to all objects in a 2003). Following this consideration, the development of a DW
model. can be structured into an integrated framework with five stages

Necessary Provider Duration < 4h


S1 and S2
Due to accuracy
and small size
U

Annual
PartSupp's
S2.Partsupp DW.Partsupp S2.Partsupp
Recent PK
Partsupp's
PKey
PKey SK PKey SK Pkey

SupKey SupKey Supkey


Qty Qty
V F
Date F Date Date

Dept Cost NN Cost

Cost F

American to European SysDate


date

Figure 6 Motivating example for conceptual model in Vassiliadis et al. (2002a).


96 S.H. Ali El-Sappagh et al.

and three levels that define different diagrams for the DW attribute class. The authors formally define attribute/class dia-
model, as explained below: grams, along with the new stereotypes, ÆÆAttributeææ and
ÆÆContainææ, defined as follows:
– Phases: there are five stages in the definition of a DW: Attribute classes are materializations of the ÆÆAttributeææ
 Source: it defines the data sources of the DW, such as OLTP stereotype, introduced specifically for representing the attri-
systems, external data sources. butes of a class. The following constraints apply for the correct
 Integration: it defines the mapping between the data sources definition of an attribute class as a materialization of an
and the data warehouse. ÆÆAttributeææ stereotype:
 Data warehouse: it defines the structure of the data
warehouse. – Naming convention: the name of the attribute class is the
 Customization: it defines the mapping between the data name of the corresponding container class, followed by a
warehouse and the clients’ structures. dot and the name of the attribute.
 Client: it defines special structures that are used by the cli- – Features: an attribute class can contain neither attributes
ents to access the data warehouse, such as data marts or nor methods.
OLAP applications.
– Levels: each stage can be analyzed at three levels or A contain relationship is a composite aggregation between a
perspectives: container class and its corresponding attribute classes, origi-
 Conceptual: it defines the data warehouse from a conceptual nated at the end near the container class and highlighted with
point of view. the ÆÆContainææ stereotype.
 Logical: it addresses logical aspects of the DW design, as the An attribute/class diagram is a regular UML class
definition of the ETL processes. diagram extended with ÆÆAttributeææ classes and ÆÆContainææ
 Physical: it defines physical aspects of the DW, such as relationships.
the storage of the logical structures in different disks, or In the data warehouse context, the relationship, involves
the configuration of the database servers that support the three logical parties: (a) the provider entity (schema, table,
DW. or attribute), responsible for generating the data to be further
propagated, (b) the consumer, that receives the data from the
3.3.2. Attributes as first-class modeling elements (FCME) provider and (c) their intermediate matching that involves the
Both in ERD model and in UML, attributes are embedded in way the mapping is done, along with any transformation and
the definition of their comprising ‘‘element’’ (an entity in the filtering. Since a mapping diagram can be very complex, this
ER or a class in UML), and it is not possible to create a rela- approach offers the possibility to organize it in different levels
tionship between two attributes. In order to allow attributes to thanks to the use of UML packages.
play the same role in certain cases, the authors propose the Their layered proposal consists of four levels as shown in
representation of attributes as FCME in UML. In a UML Fig. 8:
class diagram, two kinds of modeling elements are treated as
FCME. Classes, as abstract representations of real-world enti- – Database level (or level 0). In this level, each schema of the
ties are naturally found in the center of the modeling effort. DW environment (e.g., data sources at the conceptual level
Being FCME, classes acting as attribute containers. The rela- in the SCS ‘source conceptual schema’, conceptual schema
tionships between classes are captured by associations. Associ- of the DW in the DWCS ‘data warehouse conceptual
ations can also be FCME, called association classes. An schema’, etc.) is represented as a package (Lujan-Mora
association class can contain attributes or can be connected and Trujillo, 2003; Trujillo and Lujan-Mora, 2003). The
to other classes. However, the same is not possible with attri- mappings among the different schemata are modeled in a
butes. They refer to the class that contains the attributes as the single mapping package, encapsulating all the lower-level
container class and the class that represents an attribute as the mappings among different schemata.

Figure 8 Data mapping levels (Lujan-Mora et al., 2004).


A proposed model for data warehouse ETL processes 97

– Dataflow level (or level 1). This level describes the data rela- 4. The proposed ETL processes model (EMD)
tionship among the individual source tables of the involved
schemata towards the respective targets in the DW. Practi- To conceptualize the ETL processes used to map data from
cally, a mapping diagram at the database level is zoomed- sources to the target data warehouse schema, we studied the
into a set of more detailed mapping diagrams, each captur- previous research projects, made some integration, and add
ing how a target table is related to source tables in terms of some extensions to the approaches mentioned above. We pro-
data. pose entity mapping diagram (EMD) as a new conceptual
– Table level (or level 2). Whereas the mapping diagram of model for modeling ETL processes scenarios. Our proposed
the dataflow level describes the data relationships among model mainly follows the approach of modeling based on con-
sources and targets using a single package, the mapping dia- ceptual constructs. The proposed model will fulfill six require-
gram at the table level, details all the intermediate transfor- ments (El Bastawesy et al., 2005; Maier, 2004; Arya et al.,
mations and checks that take place during this flow. 2006):
Practically, if a mapping is simple, a single package that
represents the mapping can be used at this level; otherwise, 1. Supports the integration of multiple data sources.
a set of packages is used to segment complex data mappings 2. Is robust in view of changing data sources.
in sequential steps. 3. Supports flexible transformations.
– Attributelevel (or level 3). In this level, the mapping diagram 4. Can be easily deployed in a suitable implementation
involves the capturing of inter-attribute mappings. Practi- environment.
cally, this means that the diagram of the table is zoomed- 5. Is complete enough to handle the various extraction, trans-
in and the mapping of provider to consumer attributes is formation, and loading operations.
traced, along with any intermediate transformation and 6. Is simple in creating and maintaining.
cleaning.
In this section, we will describe EMD framework, EMD
At the leftmost part of Fig. 8, a simple relationship among metamodel, primitives of EMD constructs, and finally we will
the DWCS and the SCS exists: this is captured by a single data provide a demonstration example. A comparison and evalua-
mapping package and these three design elements constitute tion of the previous approaches against our proposed model
the data mapping diagram of the database level (or level 0). will be presented in Section 5.
Assuming that there are three particular tables in the DW that
we would like to populate, this particular data mapping pack- 4.1. EMD framework
age abstracts the fact that there are three main scenarios for
the population of the DW, one for each of these tables. In Fig. 9 shows the general framework of the proposed entity
the dataflow level (or level 1) of our framework, the data rela- mapping diagram.
tionships among the sources and the targets in the context of
each of the scenarios, is practically modeled by the respective – In the data source(s) part: the participated data sources are
package. If we zoom in one of these scenarios, e.g., mapping drawn. The data sources may be structured databases or
1, we can observe its particularities in terms of data transfor- non-structured sources. In case of structured sources; the
mation and cleaning: the data of source 1 are transformed in participated databases and their participated tables and
two steps (i.e., they have undergone two different transforma- attributes are used directly as the base source, and in case
tions), as shown in Fig. 8. Observe also that there is an inter- of non-structured sources; a conversion step should be
mediate data store employed, to hold the output of the first applied first to convert the non-structured source into struc-
transformation (Step 1), before passed onto the second one tured one (tables and its attributes). From the design view,
(Step 2). Finally, at the right lower part of Fig. 8, the way there is one conversion construct that can convert any non-
the attributes are mapped to each other for the data stores structured source into structured (relational) database, but
source 1 and intermediate is depicted. Let us point out that from the implementation view, each type of non-structured
in case we are modeling a complex and huge data warehouse, source will have its own conversion module which is called
the attribute transformation modeled at level 3 is hidden with- wrapper. Wrappers are specialized program routines that
in a package definition. automatically extract data from different data sources with

Figure 9 A general framework of EMD.


98 S.H. Ali El-Sappagh et al.

different formats and convert the information into a struc- tion DW schema. The data may be loaded directly as a
tured format. The typical tasks of a wrapper are: (a) fetch- result of certain transformation function or captured from
ing data from a remote resource, (b) searching for, the desired temporary tables in the staging area.
recognizing and extracting specified data, and (c) saving this
data in a suitable structured format to enable further Notice that both data sources and data warehouse schemas
manipulation (Vassiliadis et al., 2005). should be defined clearly before starting to draw EMD. Also
– Extraction: during the extraction process some temporary the arrows’ directions show that first, the data sources are
tables may be created to hold the result of converting drawn, after that a set of transformation are applied, and then
non-structured sources into databases. The extraction pro- the data are loaded to the destination data warehouse schema.
cess includes initial extraction and refresh. The initial
extraction takes place when the ETL scenario executed 4.2. EMD metamodel
for the first time while there is no data in the destination
data warehouse. The refresh extraction takes place to cap- EMD is a proposed conceptual model for modeling the ETL
ture the delta data (difference between old data in the processes which are needed to map data from sources to the
DW and updated data in the data sources). It is preferred target data warehouse schema. Fig. 10 shows the metamodel
to separate the ETL scenario with initial extraction from architecture for the proposed conceptual model EMD. The
the ETL scenario with refresh extraction. This means that metamodel of the proposed EMD is composed of two layers;
the user may need to build two EMD models for the same the first layer is abstraction layer in which five objects (func-
ETL scenario; one for the initial extraction, and the other tion, data container, entity, relationship, and attribute) are
for the refresh extraction using the old data in the temp clearly defined. The objects in the abstraction layer are a high
tables found in the staging area. level view of the parts or objects that can be used to draw an
– In the DW schema part: the data warehouse schema table EMD scenario.
(fact or dimension) is drawn. In spite of that the fact table The second layer is the template layer which is an expansion
and the dimension table are clearly different in their func- to the abstraction layer.
tionalities and features but all of them are data containers. The link between the abstraction layer and the template
Basically the data warehouse is stored as relational struc- layer may be considered as an aggregation relationship. A
ture not as multidimensional structure. The multidimen- function may be an attribute transformation, an entity trans-
sionality occurs in the online analytical processing formation, a UDF (user defined function), or convert into
(OLAP) engines. structure (relation). Fig. 11 shows the types of transformation
– In the mapping part: the required transformation functions functions that can be applied to sources in the proposed EMD.
are drawn. The transformation operations take place on the An entity transformation is a function that can be applied
incoming data from both the base source and/or the tempo- to a source table (e.g. duplicate elimination, union, etc.). An
rary source in the staging area. Some transformation oper- attribute transformation function can be applied to a source
ations lead to temporary results which are saved in attribute (e.g. to upper case, to String, etc.). A user defined
temporary tables in the staging area. function (UDF) is any function that may be added by the user
– The staging area: a physical container that contains all tem- who is the creator of the ETL scenario (e.g. unification be-
porary tables created during the extraction process or tween different types of units). Convert into structure is a func-
resulted from the applied transformation functions. tion that can be applied to the non-structured (semi-structured
– Loading: as the data reaches the final appropriate format, it and unstructured) data sources so that it can be converted into
is loaded to the corresponding data element in the destina- structured source to enable the other transformation functions

Figure 10 EMD metamodel.


A proposed model for data warehouse ETL processes 99

Figure 11 Types of transformations in EMD.

to be applied on it. A data container may be a source database, 4.3. Primitives of EMD constructs
a target data warehouse or data mart, or non-structured
source. An entity may be a source table, a dimension table, The basic set of constructs that is used in the proposed entity
or a fact table. A relationship may be an extractor or a loader. mapping diagram are shown in Fig. 12. In this section, some
The extractor expresses the data extraction process from the explanation about the usage of the constructs of the proposed
source and the loader expresses the data loading process to entity mapping diagram will be given, as follows:
the final destination. The attribute may be a table column or
a non-structured file field.  Loader relationship: is used when the data are moved
The metamodel can be expanded to include any extra ob- directly from the last source element (the actual source or
jects that may be required in the future. The user can use in- the temporary one) to the target data element. The actual
stances of the template layer to create his model to build the source; is the base source from which the data are extracted,
desired ETL scenario. It should be mentioned here that the on the other hand, the temporary source; is the one that is
user of EMD is the data warehouse designer or the ETL de- resulted during the transformation operations.
signer; this means that some primitive rules, limitations, and  Optional loader relationship: is used to show that the loaded
constrains are kept in his mind during the usage of different data to the output attribute could be extracted from candi-
parts of EMD, i.e., union operation will be applied successfully date source element x or candidate source element y.
when the participated tables have the same number of  Convert into structure: represents the conversion operations
attributes with the same data types for the corresponding required to restructure the non-structured base source into
attributes. structured one (relations as tables and attributes). The

Figure 12 Graphical constructs for the proposed EMD.


100 S.H. Ali El-Sappagh et al.

conversion operation saves its result into temporary tables, tions, packages (units) conversions, and so on, as shown
so the transformation operation can be applied to the new in Fig. 11(c).
temporary source.  Non-structured source: represents any source that is not in
 Entity transformation operation: this kind of transforma- the relational structure. The non-structured source may be
tions usually results in a temporary entity. There are stan- semi-structured or unstructured source such as XML files,
dard operators that are used inside this construct, web logs, excel workbook, object oriented database, etc.,
Fig. 11(a) shows some of these operators. as shown in Fig. 11(d).
 Attribute transformation operation: standard operations are
used with this construct, Fig. 11(b) shows sample of these Notice that a symbol or shorthand of the operation is put
operators. inside the entity or the attribute transformations construct.
 User defined function (UDF) as a transformation operation: The transformation functions that take place in the proposed
user can use his defined operations, so any kind of transfor- model EMD are classified into built-in or standard functions,
mation can be added, such as currency conversion func- such as join, union, and rename, and user defined functions as
mentioned above, like any formula defined by the user. An-
other classification for the transformation functions according
to the level of transformation is entity transformations func-
tions, and attribute transformations functions.

4.4. Demonstration example

To illustrate the usage of our proposed graphical model, we


introduce a simple example. A company wants to build a data
warehouse for monitoring the sales processes in its two
branches. It has a relational data source described by schema
DS1 for selling books, shown in Fig. 13, another relational
data source described by schema DS2 for selling general prod-
ucts, shown in Fig. 14. A relational data warehouse is designed
to capture sales data from the two predefined data sources.
The star schema in Fig. 15 shows the design of the proposed
data warehouse which consists of one fact table and four
dimensions tables.
Fig. 16 depicts the entity mapping diagram for building the
products dimension from the desired data sources, passing
through the required ETL activities. The explanation of this
diagram is as follows:

Figure 13 Relational schema DS1 for books-orders database.  DS1: refers to the first data source (books-orders database).

Figure 14 Relational schema DS2 for products-orders database.


A proposed model for data warehouse ETL processes 101

Figure 15 Star schema for the proposed data warehouse.

Figure 16 EMD scenario for building products dimension.

 DS2: refers to the second data source (products-orders represented using a set of transformation steps; starting with
database). join operation between Book and Category tables, then remov-
ing the redundant records by applying the duplicate elimina-
There are two entities from each data source that partici- tion operation.
pate in this diagram: Book (BookID, BookTitle, CategoryID) Temporary entity (Temp1) is created to capture the inter-
and Category (CategoryID, CategoryName) from the first mediate data that result from the previous operations. Notice
data source, and Products (ProductID, ProductName, Bran- that data of attribute Temp1.CategoryID could be loaded
dID) and Brands (BrandID, CategoryName) from the second optionally from DS1.Book.CategoryID or DS1.Category.Cat-
data source. egoryID. The same activities take place in the other site that
DW1: refers to the data warehouse schema to which the contains DS2 to result Temp2 table.
data will be moved, we may have one or more DW schemas, After that, some attribute transformation operations take
one or more data mart (DM) schemas, or a combination of place before loading data to the target data warehouse, some
DW and DM. Dim_Products is a dimension entity found in of them are described as follows: (++) is a user defined trans-
DW1. In the middle of the diagram, mapping processes are formation operation applied to Temp1.ProductID to add
102 S.H. Ali El-Sappagh et al.

10,00,000 to each product code number as a user requirement. sage will appear to alert the user and the application will halt.
ProductID and CategoryID data types are transformed to If the connection succeeded, new database ‘‘ETL’’ will be cre-
string data type by using ToString (TS) operation. Temp2 ta- ated. ‘‘ETL’’ plays the role of repository in which the metadata
ble is transferred to the site of DS1 using file transfer protocol about the EMD scenarios will be stored. The metadata in the
(FTP) operation, then a union operation (U) runs to combine repository will be used to generate the mapping document.
the two tables. The loader relationships connected to Product- After creating ‘‘ETL’’ database the user may either create
Name and CategoryName attributes mean that data is loaded new EMD scenario or open existing one to complete it. In case
from these two attributes to their corresponding attributes in of creating new scenario, new building area will appear to en-
the DW without any transformation. able the user to draw and build his new model, and in case of
We can now develop a prototype tool (named EMD opening an existing EMD scenario, two files will be read, the
Builder) to achieve the following tasks: first one is ‘‘.etl’’ file from which the old scenario will be loaded
to the drawing area to enable the user to complete it, and the
– Introducing a tool for drawing the entity mapping diagram second file is ‘‘.sql’’ in which the SQL script of the old part of
scenarios using a pallet of graphical controls. the existing scenario were written and will be complete as the
– Implementing a set of transformation operations. user completes his model. The next module loads both the
– Transforming the graphical model to a code by generating metadata about the databases found on the database manage-
SQL script. ment system and ‘‘EMD Builder’’ interface icons. The metada-
– Generating the mapping document according to Kimball’s ta includes the databases names, tables, attributes, and so on.
standards (Kimball and Caserta, 2004). The interface icons will be loaded from our icon gallery, the
– Executing the EMD scenario on the data sources to apply interface elements will be shown in next sections. The next
the extraction, and transformation operations, then loading module facilitates the drawing process by which the user can
data to the target DW schema. use our palette of controls to draw and build his EMD sce-
– The code of may be written in C# or JAVA object-oriented nario. By using the execution module, the EMD model will
programming languages and a rational database manage- be translated into SQL script then executed on the incoming
ment system as Oracle or Microsoft SQL Server. data from the source databases, so the extraction, transforma-
tion, and loading processes can be applied and the desired re-
We propose the architecture in Fig. 17 for the model, and in cords will be transferred to the target DW schema in the
the future work we will implement and test this model. required format. The last module is responsible for saving
The first module checks the connection to the database the user’s EMD model. During the save operation, three files
management system installed on the machine on which the are generates; the first one contains the user EMD model in
source databases exist. If the connection failed, an error mes- a binary format, so the user can open it at any time to update
in its drawing, the second contains the generated SQL script,
and the third generated file is the mapping document which
is considered as dictionary and catalog for the ETL operations
found in the user EMD scenario. The user can specify the
folder in which the generated files will be saved. The generated
files can be transferred from one machine to be used on an-
other one that contains the same data sources and the same
target data warehouse schema; this means that the generated
files from our tool are machine independent, however they
are data source and destination schema dependent. It is clear
that the destination is a whole schema (data warehouse or data
mart), but each part of this schema (fact or dimension) is han-
dled as a standalone destination in a single EMD scenario.

5. Models evaluation and comparison

Table 1 contains the matrix that is used to compare the differ-


ent ETL modeling approaches and evaluates our proposed
model against the other models. The letter P in the matrix
means that this model had partially supported the correspond-
ing criteria.

6. Other related work

The ETL process, in data warehouse, is a hot point of research


because of its importance and cost in data warehouse project
building and maintenance. The method of systematic review
to identify, extract and analyze the main proposals on model-
ing conceptual ETL processes for DWs (Muñoz et al., 2010a).
Figure 17 Basic modules of ‘‘EMD Builder’’. Generating ETL processes for incremental loading (Jörg and
A proposed model for data warehouse ETL processes 103

Table 1 Comparison and evaluation matrix.


Criteria Model
Mapping expressions Conceptual constructs UML environment EMD
Design aspects
Complete graphical model No Yes Yes Yes
New constructs No Yes No Yes
(OO) concept independent Yes P No Yes
DBMS independent Yes Yes Yes Yes
Mapping operations Yes Yes Yes Yes
User defined transformation No No No Yes
Mapping relationship Yes Yes Yes Yes
Source independent (non-relational) No No No Yes
Source converting No No No Yes
Flat model Yes Yes No Yes
Implementation aspects
Develop a tool Yes Yes No Yes
Generate SQL Yes No No Yes
Generate mapping document No No No Yes
Non-relational handling No No No No
Evaluation 7 7.5 4 13
Yes = 1; No = 0; P: partial = 0.5; total = 14.

Deßloch, 2008). A simulation model for secure data extraction is the template layer which is an expansion to the abstraction
in ETL processes (Mrunalini et al., 2009). A set of measures layer. The user can add his own layer in which the ETL de-
with which to evaluate the structural complexity of ETL pro- signer draws his EMD scenario. We also set a framework for
cess models at the conceptual level discussed in Muñoz et al. using this model. The framework consists of data sources part,
(2010b). In Simitsis and Vassiliadis (2008) the author discusses data warehouse schema part, and mapping part. Both data
the mapping of the conceptual model to the logical model. sources and data warehouse schemas should be defined clearly
Generating an incremental ETL process automatically from before starting to draw EMD scenario. By comparing the pro-
the full ETL process is discussed in Zhang et al. (2008). In Sim- posed model to the previous research projects using the evalu-
itsis et al. (2008) the author discusses the application of natural ation matrix, the proposed model handle may weak points that
language generation techniques to the ETL environment. Mea- appear in the previous work. In the future work to this paper,
sures the ETL processes models in data warehouses are dis- we will develop and test a prototype tool call it ‘EMD Builder’
cussed in Muñoz et al. (2009). to achieve the following tasks: introducing a tool for drawing
the entity mapping diagram scenarios using a pallet of graph-
ical constructs, implementing a set of transformation opera-
7. Conclusion and future work
tions, transforming the graphical model to a code by
generating SQL script, and generating the mapping document
ETL processes are very important problem in the current re-
according to Kimball’s standards.
search of data warehousing. In this paper, we have investigated
a very important problem in the current research of data ware-
housing. This problem represents a real need to find a standard References
conceptual model for representing in simplified way the extrac-
Arya, P., Slany, W., Schindler, C., 2006. Enhancing Wrapper Usability
tion, transformation, and loading (ETL) processes. Some ap-
through Ontology Sharing and Large Scale Cooperation.
proaches have been introduced to handle this problem. We
<www.ru5.cti.gr/HT05/files/andreas_rath.ppt> (accessed 2006).
have classified these approaches into three categories; fist, is Bernstein, P., Rahm, E., 2000. Data warehouse scenarios for model
modeling based on mapping expressions and guidelines, sec- management. In: Proceedings of the 19th International Conference
ond, is modeling based on conceptual constructs, and the final on Conceptual Modeling (ER’00), LNCS, vol. 1920, Salt Lake
category, is modeling based on UML environment. We have City, USA, pp. 1–15.
explained each model in some detail. Berson, A., Smith, S.J., 1997. Data Warehousing, Data Mining, and
What is more, we proposed a novel conceptual model entity OLAP. McGraw-Hill.
mapping diagram (EMD) as a simplified model for represent- Demarest, M., 1997. The Politics of Data Warehousing. <http://
ing extraction, transformation, and loading processes in data www.hevanet.com/demarest/marc/dwpol.html>.
Dobre, A., Hakimpour, F., Dittrich, K.R., 2003. Operators and
warehousing projects. In order to explain our proposed model;
classification for data mapping in semantic integration. In:
we defined a metamodel for the entity mapping diagram. In the
Proceedings of the 22nd International Conference on Conceptual
metamodel we defined two layers; the first is the abstraction Modeling (ER’03), LNCS, vol. 2813, Chicago, USA, pp. 534–547.
layer in which five objects (function, data container, entity, El Bastawesy, A., Boshra, M., Hendawi, A., 2005. Entity mapping
relationship, and attribute) are clearly defined. The objects in diagram for modeling ETL processes. In: Proceedings of the Third
the abstraction layer are a high level view of the parts or ob- International Conference on Informatics and Systems (INFOS),
jects that can be used to draw an EMD scenario. The second Cairo.
104 S.H. Ali El-Sappagh et al.

Inmon, B., 1997. The Data Warehouse Budget. DM Review Magazine, Muñoz, Lilia, Mazónand, Jose-Norberto, Trujillo, Juan, 2010b. A
January 1997. <www.dmreview.com/master.cfm?NavID=55& family of experiments to validate measures for UML activity
EdID=1315>. diagrams of ETL processes in data warehouses. Information and
Inmon, W.H., 2002. Building the Data Warehouse, third ed. John Software Technology 52 (11), 1188–1203.
Wiley and Sons, USA. Naqvi, S., Tsur, S., 1989. A Logical Language for Data and
Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P., 2003. Funda- Knowledge Bases. Computer Science Press.
mentals of Data Warehouses, second ed. Springer-Verlag. Oracle Corp., 2001. Oracle9i Warehouse Builder User’s Guide,
Jörg, Thomas, Deßloch, Stefan, 2008. Towards generating ETL Release 9.0.2, November 2001. <http://www.otn.oracle.com/prod-
processes for incremental loading. In: ACM Proceedings of the ucts/warehouse/content.html>.
2008 International Symposium on Database Engineering and Rifaieh, R., Benharkat, N.A., 2002. Query-based data warehousing
Applications. tool. In: Proceedings of the Fifth ACM International Workshop on
Kimball, R., Caserta, J., 2004. The Data Warehouse ETL Toolkit. Data Warehousing and OLAP, November 2002.
Practical Techniques for Extracting, Cleaning, Conforming and Shilakes, C., Tylman, J., 1998. Enterprise Information Portals.
Delivering Data. Wiley. Enterprise Software Team. <http://www.sagemaker.com/com-
Kimball, R., Reeves, L., Ross, M., Thornthwaite, W., 1998. The Data pany/downloads/eip/indepth.pdf>.
Warehouse Lifecycle Toolkit: Expert Methods for Designing, Simitsis, A., 2003. Modeling and Managing ETL Processes. VLDB
Developing, and Deploying Data Warehouses. John Wiley and Ph.D. Workshop.
Sons. Simitsis, Alkis, Vassiliadis, Panos, 2008. A method for the mapping of
Lujan-Mora, S., Trujillo, J., 2003. A comprehensive method for data conceptual designs to logical blueprints for ETL processes.
warehouse design. In: Proceedings of the Fifth International Decision Support Systems, Data Warehousing and OLAP 45 (1),
Workshop on Design and Management of Data Warehouses 22–40.
(DMDW’03), Berlin, Germany. Simitsis, Alkis, Skoutas, Dimitrios, Castellanos, Malú, 2008. Natural
Lujan-Mora, S., Vassiliadis, P., Trujillo, J., 2004. Data mapping language reporting for ETL processes. In: Proceeding of the ACM
diagram for data warehouse design with UML. In: International 11th International Workshop on Data Warehousing and OLAP,
Conference on Conceptual Modeling, Shanghai, China, November pp. 65–72. ISBN: 978-1-60558-250-4.
2004. Staudt, M., Vaduva, A., Vetterli, T., 1999. Metadata Management and
Madhavan, J., Bernstein, P.A., Rahm, E., 2001. Generic schema Data Warehousing. Technical Report, The Department of Infor-
matching with cupid. In: Proceedings of the 27th International mation Technology (IFI) at the University of Zurich.
Conferences on Very Large Databases, pp. 49–58. Stonebraker, M., Hellerstein, J., 2001. Content integration for e-
Maier, T., 2004. A formal model of the ETL process for OLAP-based business. In: Proceedings of the ACM SIGMOD/PODS 2001,
web usage analysis. In: Proceedings of the Sixth WEBKDD Santa Barbara, CA, May 21–24, 2001.
Workshop: Webmining and Web Usage Analysis (WEBKDD’04), Trujillo, J., Lujan-Mora, S., 2003. A UML based approach for
in conjunction with the 10th ACM SIGKDD Conference modeling ETL processes in data warehouses. In: Proceedings of the
(KDD’04), Seattle, Washington, USA, August 22, 2004 (accessed 22nd International Conference on Conceptual Modeling. LNCS,
2006). Chicago, USA.
Miller, R.J., Haas, L.M., Hernandez, M.A., 2000. Schema mapping as Vassiliadis, P., 2000. Data Warehouse Modeling and Quality Issues.
query discovery. In: Proceedings of the 26th VLDB Conference, Ph.D. Thesis, Department of Electrical and Computer Engineering,
Cairo. National Technical University of Athens (Greece).
Moss, L.T., 2005. Moving Your ETL Process into Primetime. <http:// Vassiliadis, P., Simitsis, A., Skiadopoulos, S., 2002. Conceptual
www.businessintelligence.com//ex/asp/code.44/xe/article.htm> modeling for ETL processes. In: Proceedings of the Fifth ACM
(visited June 2005). International Workshop on Data Warehousing and OLAP, pp. 14–
Mrunalini, M., Kumar, T.V.S., Kanth, K.R., 2009. Simulating secure 21.
data extraction in extraction transformation loading (ETL) pro- Vassiliadis, P., Simitsis, A., Skiadopoulos, S., 2002. Modeling ETL
cesses. In: IEEE Computer Modeling and Simulation Conference. activities as graphs. In: Proceedings of the Fourth International
EMS’09. Third UKSim European Symposium, November 2009, Workshop on the Design and Management of Data Warehouses
pp. 142–147. ISBN: 978-1-4244-5345-0. (DMDW’02), Toronto, Canada, pp. 52–61.
Muñoz, Lilia, Mazón, Jose-Norberto, Trujillo, Juan, 2009. Measures Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M., 2003. A
for ETL processes models in data warehouses. In: ACM Proceed- framework for the design of ETL scenarios. In: Proceedings of the
ing of the First International Workshop on Model Driven Service 15th CAiSE, Velden, Austria, June 16, 2003.
Engineering and Data Quality and Security, November 2009. Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M., Skiadopo-
Muñoz, Lilia, Mazon, Jose-Norberto, Trujillo, Juan, 2010. Systematic ulos, S., 2005. A generic and customizable framework for the
review and comparison of modeling ETL processes in data design of ETL scenarios. Information Systems Journal.
warehouse. In: Proceedings of the Fifth Iberian Conference on Zhang, Xufeng, Sun, Weiwei, Wang, Wei, Feng, Yahui, Shi, Baile,
IEEE Information Systems and Technologies (CISTI), August 2008. Generating incremental ETL processes automatically. In:
2010, pp. 1–6. ISBN: 978-1-4244-7227-7. IEEE Computer and Computational Sciences, pp. 516–521.

You might also like