You are on page 1of 4

An ETL Services Framework Based on Metadata

Huamin Wang Zhiwei Ye


International School of Software School of Computer Science
Wuhan University Hubei University of Technology
Wuhan, P.R.China, 430079 Wuhan, P.R.China, 430068
H-wealth@163.com weizhiye121@163.com

Abstract—This paper first analyzed the problems of existing ETL because workflow can separate the ETL process from view
tools, and proposed an ETL service model based on metadata, and maintenance. The use of this approach has the advantage of
then summarizes the types of metadata and their application scope. being able to better describe the ETL process, shortcomings are
Based on this ETL service model, a concrete ETL service users can not make advantage of the existing research results of
framework was put forward; many important services were also view maintenance in the relational model, and graphical working
discussed, such as metadata management service, metadata ways are also not suitable for ETL processes which have a large
definition services, ETL transformation rules service, process number of rules. For instance, it is easier to form a complex map
definition service, SQL code generation and optimization services, graphics when establishing the mapping rules between the
process control services and so on. At last, definition method and source data and the destination data which conclude many fields.
related algorithms of ETL rules are designed and analyzed. There is also a representative of the viewpoint which is raised by
Practice has proved that the model and framework proposed in this
Christof Bornhoved that use metadata-driven integration model
paper can improve the ETL efficiency to a large extent.
to integrate Internet resources [3], whose aim is to fully utilize
Keywords-ETL Services Framework; Metadata; ETL Rules the data warehouse metadata for data integration, but its
drawback is difficult to implement ETL processes and
I. Introduction interoperability of rules-based business.
ETL (Extract-Transform-Load), which is the process of This paper combines the advantages of the above two kinds
extracting data from a variety of heterogeneous data sources, of ideas and put forward the ETL service model based on
and transforming those extracted data into needed format, and metadata. That is, in order to implement the entire ETL process,
then loading those data into the DW (Data Warehouse) [1]. ETL designers only need to analyze the ETL process and describe the
is not only the cornerstone and soul of building a data warehouse, involved ETL metadata, and then automatically generate SQL
but also a necessary step for establishing the DW, so it plays an script based on these metadata. And eventually all the ETL
important role in the process of building a data warehouse. process can be released with the unified service format. The
Under normal circumstances, it is 60% to 80% that workload of main advantage of this idea is its ability to maximize the
developing ETL accounting for that of developing entire data research results of mature relationship model, and to maximize
warehouse system [DEMA97]. the reuse of ETL process, so that ETL process is more flexible
and has higher performance. Therefore, the focus of this study is
Besides hand-coding ETL method, users can also use
the versatility and efficiency of ETL engine, which can
existing ETL tools to implement ETL process, such as IBM
implement effective expansion and flexible control the ETL
Visual Warehousing, Microsoft DTS, Oracle Warehouse Builder,
process by using metadata. In order to take full advantage of
etc. But these tools is difficult to manipulate, and is very time-
mature relational model, this study mainly emphases on the
consuming if users want to master the relevant rules and
relational model.
language. The ETL designers are required to be familiar with
data structures, ETL rules and operational processes. In addition,
designers need not only to understand the overall ETL process, II. Design of ETL Service Model
but also to know the detailed definition of each concrete steps. Metadata is the data which is used to describe involved data.
So, it is very difficult to improve the efficiency of ETL process In different environments, metadata is on behalf of a different
development. What’s more, designers must redesign the ETL type of data. The metadata of this study is not only including
process when changing business rules or altering data source and data warehouse meta-data, but also involved with the metadata
data destination. So, it can be seen that it is difficult to design of data sources, transformation rules, extraction rules and
rapidly and reuse rules for majority ETL tools. workflow rules. The metadata in the ETL service model,
In fact, there is a large number of studies on the ETL respectively, involve different scopes.
implement way. For example, one opinion thinks that the
workflow should be used to describe the ETL process [2],

Funded by Key Laboratory of Geo-informatics of State Bureau of Surveying and


Mapping(200706)

978-1-4244-5874-5/10/$26.00 ©2010 IEEE


1) Data warehouse metadata. In the ETL model, this logic edges (E). Each logic node can be divided into two types,
metadata is mainly used to describe the relevant information of one is independent logical node, and the other one is
ETL process of DW, especially description information of data combinational logic nodes. Independent logic node refers to an
source and data destination. In data warehouse, the format of ETL task which has high cohesion and should not to be divided.
Combinational logic nodes are the combination of the different
metadata description information is consistent, because
granularity ETL process services or independent logic node,
relationship schema is the foundation of DW and everything is whose call method is as same as the independent logic node.
relationship, such as data warehouse model, dimension, data This is the theoretical basis of ETL process sharing, and which
granularity, relational table, attribute set, integrity constraints, make it is possible to avoid the shortcomings of traditional
views, stored procedures, data integration rules. coding methods. While relative to the logic node, the logical
2) Data source metadata. Data source metadata mainly edge is only used to describe the logic relationship between
describes the data source information that is to be extracted, nodes. If the logic target A is very similar with logic target B, B
including data format, IP address, access ports, database schema, does not need to be modified or just needs minor modifications.
relational table structure, attribute sets, indexes, views, stored The metadata can actually be seen as an abstract of ETL
procedures, referential integrity constraints, etc., in addition, processes or a variety of data types, and is always stored with
static method. For example, the XML document for controlling
involving average, maximum and minimum value of a field,
process, which executed with interpretation way, can be used to
total number of records and other statistical information. store the directed graph and implement the ETL logic target
3) Extraction rules metadata. Extraction rules metadata finally.
mainly describes the mapping relationship between the data
source and DW, including the source field information,
destination field information, transformation rules information.
It can be the many to one relationship between source field and
destination fields, and the different situations can use different
extraction rules.
4) Transformation rules metadata. The transformation rules
involved with this study are implemented by stored procedure,
triggers, self-defined functions of DW, and the definition and
call method of these transformation rules are preserved through
metadata, such as the stored procedure name, function
description, parameter type and the return value types.
5) Process rules metadata. Process rules metadata primarily
used to describe the detail information of data extraction
Figure 1. ETL Services Model Based on Metadata
processes, such as the execution sequence of process, exception
and handling measures. In the execution sequence of process, it
is allowed to include other existing different granularity ETL III. ETL Services Framework
processes, and the description method is as same as the ordinary
According to the model analysis, this paper looks metadata
nodes, which make all the ETL process can be reused.
as the core and everything functions are organized as the
In this ETL service model, metadata plays an important role,
services, so all of the ETL services are released as a unified
and is the core and foundation in the entire ETL process design
format, and can be accessed with the unified formats. ETL
and control. All the data managed by metadata and
services framework shown in Fig. 2. Framework's core services
transformation rules use those metadata to form an execution
include: metadata management services, metadata definition
sequence, and then generate the destination data of DW. The
services, ETL transformation rules service, process definition
changing of ETL process, such as adding a ETL process, adding
service, SQL code generation and optimization services, process
a task, deleting a task and so on, are responded by modifying the
control services, logging services, change management services,
metadata. According to the characteristics of ETL processes,
exception handling services.
their sequence can be divided into a fixed part and variable part,
and the fixed part can be directly implemented by transformation 1) Metadata management services. The main functions of
rules, while the variable part is managed by metadata. ETL metadata management services are to parse XML documents
service model is shown in Fig. 1. and manage metadata object pool. Metadata is stored in the
In the ETL service model, each ETL service can be seen as a XML file, some metadata are universal, such as the metadata for
logical target which is used for completing a specific ETL tasks. different relational schema, but some metadata is associated with
Those logical targets are often abstracted into a directed graph G rules. In order to efficiently parse XML files, metadata objects
= (V, E) which is combined with a series of logic nodes (V) and
will be loaded according to the types of metadata when system information will be saved into metadata database. The metadata
initialization. database can provide the corresponding metadata interface
2) Metadata definition services. The role is providing data services, including the method to access metadata service and
definition function for the involved process and data of ETL parameter information.
process, the XML document which conclude definition

Figure 2. ETL Framework


3) ETL transformation rules services. The role is providing information of the metadata database [4]. The SQL statement
the data transformation services for system according to the set optimization can be not only manually modified, but also
rules by users. And these services mainly provide corresponding through the customize optimization class. So, the ETL engine
transformation rules for metadata database, process definition have highly dynamic expansion capabilities, and following the
service and process control service. in-depth application, the optimize mechanism would be even
4) Process definition services. Process definition services better.
divide each ETL process into a series of different granularity 6) Process control services. These services are mainly
ETL processes or ETL services, in fact, the actual formation of responsible for process control, and ensure the sequence to be
ETL processes are defined as a series of SQL statements. This implemented according to its definition. At the same time, the
serialization of the SQL statement is very easy to extend and service also includes error recovery and exception handling
easy to flexibly custom. Each ETL process need to create two mechanism. When any error happened in any node, the error
processes: the total ETL process and incremental ETL process. will be record into log table and restore the tasks performed by
The total ETL process will be only executed when first loading the current node, the error will be resolved.
data, while incremental ETL process will be used in the 7) Database access services. Using a common data access
repeatedly updating environment. Because these services have interface, database access services directly connect to the
to query the relevant meta-data and other information when it is database and execute the SQL statements according to metadata
executed each time, object serialization and deserialization information of the metadata database, and finally return the
technology is used. That is, a binary file will be created into the result to client.
local hard disk after XML files for describing ETL Processes 8) Log Services. Log service is an effective way to ensure the
are compiled and serialized at the first time. In this way, the quality of ETL services, and can provide the function of process
paring efficiency of ETL process will be greatly improved. monitoring and error recovery. The main log services include:
5) SQL code generation optimization services. This service is operational log for recording operational information; error log
integrated with a lot of flexibility SQL generators which provide for recording the error nodes; analysis log for recording the
the same interface and are independent among them. These statistical information, such as process load time, execution time,
generators are for controlling the fact table, dimension table, execution frequency and so on. And change log for recording
dimension mapping table and log tables firstly loading and the changing information of services.
incremental update. After the SQL statement is generated, it still 9) Change management services. Change management
needs to further optimize SQL statement according to the services provide an effective means for local or remote services
updating. Once if a service is updated, all correlative services database, and Tomcat 5.0.8 as a WEB server, and all the ETL
will be adjusted accordingly. function were released as the unified web services format, so the
clients could access those services in a standard mode. Practice
IV. ETL Rules Services has proved that ETL service model proposed in this paper has
When executing the ETL process, ETL rules play an strong and dynamic expansion abilities, and can achieve the
important role in controlling and monitoring the ETL process. object of flexibly controlling the ETL process and have good
As shown as the Fig. 3, it is a data flow chart on the module of optimization mechanism. At the same time, this framework is
rule definition, and mainly describes the data of source and able to take full advantage of the research result of relationship
destination when define ETL rules in the client side. model, and can effectively design and share the ETL process.

Extract Conversion Rule


Scan structure Definition
Extract Rule RERERENCES
information of data Source Data stream
source [1] Zhang X.F, Sun W.W, Wang W., et a1 . Generating Incremental ETL
Data stream Clean Rule
Clean Rule
Processes Automatically. Computer and Computational Sciences,2006:
Metadata
Scan structure
Definition 516—521
information of Target Data stream [2] M. Bouzeghoub, F. Fabret, M. Matulovic-Broque. Modeling Data
Target source
Load Rule Definition Load Rule Warehouse Refreshment Process as a Workflow Application. Proceedings
Task information of the International Workshop on Design and Management of Data
Warehouse, 1999
Clean Rule
Definition
ETL Task Task Creation [3] Christof Bornhvd, Alejandro P. Buchmann. A Prototype for Metadata-
based Integration of Internet Sources. Springer Berlin/Heidelberg.
1999:439-445
Figure 3. Module of Rule Definition [4] SONG Jie, WANG Da-Ling, BAO Yu-bin, YU Ge. Study on a Metadata—
Before defining the ETL rule, client need to scan the source driven ETL Approach. Journal of Chinese Computer Systems , 2007,
and target database structure and stored in the metadata database, 2(12):2167-2173 (in Chinese)
[5] Zhang Hui. The Research and Implement About ETL Tool Based on
Workflow and Metadata. Hebei University of Technology, TianJin, 2006
(in Chinese)
[6] Missier Paolo, Alper Pinar, Corcho Oscar, et al. Requirements and Services
for Metadata Management. IEEE Internet Computing. 2007, 11(5):17-25
[7] Zhao Xiaofei, Huang Zhiqiu. A Formal Framework for Reasoning on
Metadata Based on CWM. The 25th International Conference on
Conceptual Modeling, 2006: 371-384.
[8] Zhang Xufeng, Sun Weiwei, Wang Wei, et al. Generating Incremental ETL
Processes Automatically. Computer and Computational Sciences, 2006:
516-521.
[9] Aubrecht P, Kouba Z. Metadata Driven Data Transformation. The 5th World
Multi-Conference on Systemics, Cybernetics and Informatics (SCI2001),
2001: 332~336
[10] Bergamaschi, Sonia. An ETL Tool Based on Semantic Analysis of
Schemata and Instances. Knowledge-Based and Intelligent Information and
Engineering Systems - 13th International Conference, KES 2009
[11] Mrunalini, M. Modeling of Secure Data Extraction in ETL Processes Using
UML 2.0. Proceedings of the IASTED Asian Conference on Modeling and
Simulation, 2007:230-235

Figure 4.Algorithm of Rule Definition Model

and then rule definition services will achieve the metadata from
the metadata database. Those metadata will be displayed as
graphic elements, and then client can use those metadata to
define ETL rules and save it into metadata database. The detail
algorithm of rule definition is as shown as Fig. 4.

V. Conclusion
Based on the ETL service framework proposed in this paper,
an ETL prototype system has been developed and its
performance has been carried out. In this prototype system,
ORACLE 10G were used as data warehouse and metadata

You might also like