You are on page 1of 6

The Clio Project: Managing Heterogeneity

Ren~e J. Miller 1 Mauricio A. Hern£ndez 2 Laura M. Haas 2 Lingling Yan 2


C. T. Howard Ho 2 Ronald Fagin 2 Lucian Popa 2

1Univ. of Toronto 2IBM Almaden Research Center


miller@cs.toronto.edu {mauricio,laura,lingling,ho,fagin,lucian}@Mmaden.ibm.com

Abstract overview of how Clio can be used in a case study in


Section 4 and conclude in Section 5 with a brief de-
Clio is a system for managing and facilitating the scription of the current direction of our work.
complex tasks of heterogeneous d a t a transformation
and integration. In Clio, we have collected to-
gether a powerful set of d a t a management techniques
that have proven invaluable in tackling these diffi-
2 D a t a Integration, Transfor-
cult problems. In this paper, we present the underly- mation, and Evolution
ing themes of our approach and present a brief case
study. Many modern d a t a applications in d a t a warehousing
and electronic commerce require merging, coalescing
and transforming d a t a from multiple diverse sources
1 Introduction into a new structure or schema. Many of these ap-
plications start with an understanding of how d a t a
Since the advent of d a t a management systems, the will be used and viewed in its new form. For in-
problems of d a t a integration and transformation have stance, in a d a t a exchange scenario, the exchange
been recognized as being ubiquitous and critically im- format m a y have been standardized (perhaps in the
portant. Despite their importance and the wealth form of a s t a n d a r d XML schema or DTD). Other ap-
of research on d a t a integration, practical integra- plications m a y involve an integration step in which
tion tools are either impoverished in their capabili- the transformed or integrated structure is created.
ties or highly specialized to a limited task or integra- The activities involved in m a n y of these integration
tion scenario. As a result, integration and transfor- or transformation scenarios can be grouped into three
m a t i o n remain largely manual, time-consuming pro- broad categories.
cesses. However, a careful examination of both in-
tegration tools and research proposals reveals an in-
teresting commonality in the basic d a t a management Schema and Data Management At the core
techniques t h a t have been brought to bear on these of all integration tasks lies the representation, un-
problems. This is a commonality t h a t we believe has derstanding and manipulation of schemas and the
not been sufficiently exploited in developing a gen- d a t a t h a t they describe and structure. The specifics
eral purpose integration management tool. In the of different integration proposals can vary d r a m a t -
Clio project, a collaboration between IBM Almaden ically. However, all approaches require reasoning
Research Center and the University of Toronto, we about schemas, d a t a and constraints. Often legacy
have built a tool t h a t automates the common, even schemas are underspecified or have not been main-
routine, d a t a and structure management tasks under- tained to accurately and completely model the se-
lying a wealth of d a t a integration, translation, trans- mantics of a perhaps evolving d a t a set. Since inte-
formation and evolution tasks. gration methodologies depend on the accuracy and
We begin in Section 2 with a brief overview of completeness of structural and semantic information,
some of the problems we are addressing. In Section they are best employed in an environment where
3, we present an overview of the Clio architecture specified (and unspecified) schema information, con-
and discuss how we support common requirements straints and relationships can be learned, reasoned
for managing heterogeneous data. We present a brief about and verified.

78 S I G M O D Record, Vol. 30, No. 1, March 2001


Correspondences Management A second step mated, is always complete, nor always accurate for
common to all integration tasks is the understand- all possible schemas. As a result, it is important
ing of how different, perhaps independently devel- to permit verification of the correspondences, either
oped, schemas (and data) are related. To support manually or using a knowledge discovery technique.
this activity, a set of correspondences or "matches" Second, the sheer number of correspondences can be
between schemas must be determined. In the schema enormous. No successful matching technique has em-
integration literature, this process is referred to as ployed solely relation or class level correspondences.
determining "inter-schema" relationships [KP~99]. In Rather, matching requires a fine grain specification
model management, it is referred to as model match- of correspondences at the attribute (or even data)
ing [BHP00]. For example in the schemas of Figure level. As a result, any technique that uses manual
2, different terminology is used within the schemas specification or verification of correspondences, must
(and perhaps within the data). Before we can in- necessarily be incremental in nature to permit a user
tegrate the two schemas (or before we can translate to work with large schemas and large sets of corre-
data from one representation to the other), we must spondences without being overwhelmed. An incre-
have some understanding of how the schemas corre- mental approach also facilitates the correction and
spond. refinement of correspondences.
Integration tasks that involve matching often can-
not be fully automated since the syntactic represen- Mapping Management The third common step
tation of schemas, metadata and data may not com- is that of creating an operational mapping between
pletely convey the semantics of different data sets. schemas. Such a mapping is a program or set of
As a result, we must rely on an outside source (either queries than can be used to translate data between
a user or a knowledge discovery technique) to pro- the schemas. Creating and maintaining such map-
vide some information about how different schemas pings is today a largely manual (and extremely com-
(and data) correspond. There have been a host of plex) process. In some integration scenarios, the
techniques developed for (partially) automating the mapping (perhaps a view definition) may be an arti-
matching task developed both for the specific prob- fact of the transformation used. However, the map-
lem of schema integration [RR99] and for the broader pings produced by a series of transformations and
model management task [BHP00]. While differing in merging steps must be integrated and composed. In
the knowledge used and the reasoning or knowledge many other applications, the integrated schema is
discovery techniques employed, at their core all these created independently of the source schemas. Hence,
techniques learn or propose associations or correspon- mapping must be done independently [MHH00]. For
dences between components of different schemas. For instance, before a data warehouse can be loaded,
example, in Figure 2, there may be a correspondence DBAs and consultants spend months determining
between Calls. Caller and References. Artifact what types of queries will be asked, and then design-
(perhaps C a l l s contains information about program ing a schema that will readily support those queries.
functions which call each other and functions are To load the warehouse, they then must create map-
considered to be program artifacts in the warehouse pings between the warehouse schema and the under-
schema). Similarly, there may be a separate corre- lying data sources' schemas. To deploy a global in-
spondence between the file in which a function is formation system, experts first determine what infor-
defined ( F u n c t i o n . F i l e ) and the source of a pro- mation it will present to the world, that is, what log-
gram artifact ( R e f e r e n c e s . S o u r c e ) . Such corre- ical structure (the transformation process), and then
spondences may be entered by a user, perhaps using a create the view definitions (the mapping creation pro-
graphical interface such as the one supported in Clio cess) that map between the new schema and the data
[MHH00], or learned using a machine learning tech- sources. The focus in these schema mapping appli-
nique applied to the data or schema names. These cations is on the discovery of a set of queries that
correspondences may in turn be given different in- realizes the mapping [MHH00].
terpretations. Perhaps a correspondence means that Mapping management, like schema and corre-
one attribute is a subset of the other. Or perhaps spondence management, requires reasoning about
it means that the two attributes are semantically re- matches and schemas. The correspondence process
lated.
may not be sufficient to fully convey the semantics of
However, regardless of interpretation, there are how schemas are related. In the example mentioned
some characteristics that all matching approaches above, if C a l l s . C a l l e r and R e f e r e n c e s . A r t i f a c t
share. First, no approach, whether manual or auto- have been matched and if F u n c t i o n . F i l e and

SIGMOD R e c o r d , Vol. 30, No. 1, M a r c h 2001 79


Relational wrapper [TS97], or from an XML file with
~User an associated XML schema. The schemas m a y be
i
legacy schemas or they m a y include an integrated
,
'( ) Go,( schema produced manually or by an integration tool.
, Schema View Data View ,
i
!
i ! The schema engine is used to augment the schema
1 with additional constraint information, if necessary.
Currently, Clio makes use of m e t a d a t a , including
/ Engine query workloads (if available) and view definitions,
along with data. For example, in the absence of de-
/ ( Know,e ge I J clared constraints, we mine the d a t a for possible keys
and foreign keys. Finally, the schemas are verified
by the user to ensure validity of generated informa-
tion. For example, a discovered foreign key or inclu-

D 22---J sion dependency m a y hold on the current instance by


accident, that is to say, it m a y not necessarily hold
for all, or even most, instances. Clio permits such
corrections to be made by a user.
Figure 1: Clio's Logical Architecture To facilitate this process, Clio makes use of a
graphical user interface to communicate information
to the user [YMHF01]. In the S c h e m a V i e w mode,
R e f e r e n c e . Source have been matched (Figure 2), users see a representation of the schemas including
these correspondences alone are not sufficient to any generated information. This view m a y be used
uniquely determine how d a t a from the source will ap- to edit or further augment the schema. In addition,
pear in the target. In particular, details about how we provide a D a t a V i e w mode, through which users
source entities are paired or joined are left unspeci- m a y see some example d a t a from the schemas to fur-
fied as are details of which source entities should be ther help them understand the schemas. The d a t a
included. Resolving such ambiguities requires rea- view can be invaluable in helping users understand
soning about schemas and constraints and m a y result opaque schema labels.
in constraints or correspondences being modified or
added.
Correspondence E n g i n e Given a pair of
schemas, the correspondence engine generates and
3 An Overview of Clio manages a set of candidate correspondences between
the two schemas. Currently, we make use of an
Clio is a system for managing and facilitating the attribute classifier to learn possible correspondences
complex tasks of heterogeneous d a t a transformation [HT01]. Clio could (and m a y in the future) be
and integration. Note that Clio does not perform augmented to make use of dictionaries, thesauri,
schema integration p e r se. Rather, Clio supports the and other matching techniques. The generated
generation and m a n a g e m e n t of schemas, correspon- correspondences can be augmented, changed or
dences between schemas and mappings (queries) be- rejected by a user using a graphical user interface
tween schemas. The logical architecture of Clio is de- through which users can draw value correspondences
picted in Figure 1. Each m a n a g e m e n t and reasoning between attributes. Entering and manipulating
component makes use of a database m a n a g e m e n t sys- value correspondences can be done in two modes. In
tem for storing knowledge gained about schemas and the S c h e m a V i e w mode, users see a representation
integrations. Clio provides schema and d a t a browsers of the schemas and create value correspondences
to elicit and obtain feedback from users and to allow by selecting schema objects to be included in
user to understand the results produced by each com- a correspondence. The alternative D a t a V i e w
ponent. mode offers a W Y S I W Y G interface that displays
example d a t a for the attributes used in the cor-
S c h e m a E n g i n e A typical session with Clio starts respondences [YMHF01]. The d a t a view helps
with the user loading one or more schemas into a user check the validity of generated and user
the system. These schemas are read from either entered information. Users m a y add and delete value
an underlying Object-Relational database, a legacy correspondences and immediately see the changes
source that has been wrapped with a Garlic Object- reflected in the example data.

80 SIGMOD R e c o r d , Vol. 30, N o . 1, M a r c h 2001


Source Schema (S) Warehouse Schema (G)

A.i 0, I Typel¸ S.,°mN0m° I Ve*sio° I Owo0r

- .j
References [Artifact!ReferentAr(ifact!Rcfe . . . . Type[ S ..... ]
DataType].~.¢, I File ]LineNo]

Figure 2: Schema of a source database and a target Software Engineering Warehouse

Mapping Engine The mapping engine supports [MHH00]. For example, when value correspondences
the creation, evolution and maintenance of mappings are added, deleted or modified within the correspon-
between pairs of schemas. A mapping is a set of dence engine, the mapping engine uses the new cor-
queries from a source schema to a target schema that respondences or modification to update the mapping.
will translate source data into the form of the target Similarly, in order to verify a particular mapping, the
schema. Clio produces a mapping (or set of alterna- mapping engine may invoke the schema engine to ver-
tive mappings) that are consistent with the available ify whether a specific constraint holds in a schema.
correspondences a n d schema information. The map-
ping engine is therefore using information gathered
by both the schema engine and the correspondence 4 A Data Warehouse Example
engine. As with the correspondences and schemas
To illustrate our approach, we present an example
constructs suggested by Clio, mappings are verified
based on a proposed software engineering warehouse
using the data view to help users understand Mter-
for storing and exchanging information extracted
native mappings. Users see example data from se-
from computer programs [BGH99]. Such warehouses
lected source tables and the contents of the target as
have been proposed both to enable new program
they would appear under the current mapping. Ex-
analysis applications, including data mining appli-
amples are carefully chosen to both illustrate a given
cations [MG99], and to promote data exchange be-
mapping (and the correspondences it uses) and to
tween research groups using different tools and soft-
illustrate the perhaps subtle differences between al-
ware artifacts for experimentation [HMPP~97]. Fig-
ternative mappings [YMHF01]. For example, in some
ure 2 depicts a portion of a warehouse schema for
cases, changing a join from an inner join to an outer
this information. This schema has been designed to
join may dramatically change the data produced by
represent data about a diverse collection of software
the mapping. In other cases, the same change may
artifacts that have been extracted using different soft-
have no effect due to constraints that hold on the
ware analysis tools. The warehouse schema was de-
schemas.
signed to be flexible and uses a very generic represen-
To permit scalability and incremental invocation tation of software data as labeled graphs. Conceptu-
of the tool, we also permit (partial) mappings to be ally, software artifacts (for example, functions, data.
read and modified. Such mappings may be created by types, macros, etc.) form the nodes of the graph. As-
a former session with Clio or by another integration sociations or references between artifacts (for exam-
tool. For example, a user may have used Clio to map ple, function calls or data references) form the edges.
a source and target schema. At a later time, after Two of the main tables for artifacts and references
the source schema has evolved, the user may again are depicted in the figure.
invoke Clio to create a mapping from the modified As new software analysis tools are deveioped, the
source to the target. The old mappings may be read data from these tools must be mapped into this inte-
in and used as a starting point for the mapping pro- grated schema. In Figure 2, we also give a relational
cess. Modification is done using operations on data representation of an example source schema. This
examples, in the data view [YMHF01]. schema was imported using a wrapper built on top of
The mapping creation process is inherently inter- output files produced by a program analysis tool. The
active and incremental. Clio stores the current map- wrapper produces a flat schema with no constraints.
ping within its knowledge base and, through an in- Clio's schema engine is used to suggest a set of keys
cremental mapping discovery algorithm, allows users and foreign keys that hold on the data.. Foreign keys
to extend and refine mappings one step at a time are depicted by dashed lines. Key attributes are un-

SIGMOD R e c o r d , Vol. 30, No. 1, M a r c h 2001 81


derlined. The user may use Clio's Schema View to with its eallee using $1, but not using $2). Similarly,
browse, edit or augment this schema information. an example will be used of a function that is called
To start the correspondence process, if the ware- by at least one function but does not call another
house is populated with data, our correspondence en- function. Of course, such examples are used only if
gine will apply an attribute matching algorithm to de- they are available in the data source. Once a join
termine potential correspondences based on the char- path is selected (perhaps $1 is selected), examples
acteristics of the values of different attributes. For ex- are also used to determine if the mapping should be
ample, if values in C a l l s . c a n e r in the source and a left-outer join or if the user only wishes References
R e f e r e n c e s . A r t i f a c t are all UNIX file pathnames to be populated with functions that appear in Calls
(that is, a sequence of mainly alphanumeric tokens (an inner join) [YMHF01].
separated by '/'s with perhaps one token having a '.' Once the user is happy with t h i s (partial) map-
extension), our mining algorithm would suggest that ping, she may proceed incrementally by entering
these attributes match [HT01]. If the warehouse is more value correspondences, by using operations
not populated with data, and the user cannot provide on data examples to refine the current mapping
a few example values, then correspondences may be [YMHF01], or by reinvoking the correspondence en-
entered using our Schema View as suggested in Fig- gine. Due to space limits we only illustrate this fi-
ure 3. It is unlikely that a mining technique based nal option. Clio may make use of a (partial) map-
solely on schema labels will be effective for this ex- ping to deduce additional correspondences. Given
ample since it is not obvious, even using ontology the mapping $1 and the discovered constraints on
or dictionary based techniques, that C a l l s . C a l l e r S, Clio can infer that D a t a R e f . F c t also may cor-
should be matched with References. Artifact. respond to Reference. Artifact. Similarly, if the
Suppose the user had only entered the first four cor- attribute matching algorithm was applied to the
respondences (ffl - f 4 ) indicating how function call in- source attributes alone, a value correspondence be-
formation corresponds to the warehouse schema. Us- tween Function. File and DataType. File m a y have
ing the discovered schema information together with already been deduced within the source schema. This
these correspondences, Clio's mapping engine may correspondence is not based on the two attributes
produce the following two mappings. The first popu- containing the same values necessarily. Rather, it is
lates the Source attribute of the target with the F i l e an indication that the values share similar character-
attribute of the caller function (Mapping $i). The istics and therefore may (possibly) have a semantic
second populates the Source attribute of the target relationship. Based on this information, Clio may
with the F i l e attribute of the called function (Map- propose f6 and fs as potential correspondences. The
ping $2). Note that correspondence f4 maps the re- mapping engine may then search for potential join
lation name into the ReferenceType value, effectively paths to use in mapping data reference information
transforming schema to data [Mil98]. The left-outer to the warehouse.
join is used to ensure information is not lost in the
mapping. T h a t is, Clio will prefer mappings that
map every function (whether it participates in the 5 Conclusion
Calls relation or not) to a target value.
We have discussed Clio, a system for managing data
Sl: SELECT C.Caller, C.Callee, relname(C), F.File transformation and integration under development at
FROM Function F left outer join Calls C IBM Almaden. Clio's Integration Engine is composed
WHERE C.Caller = F.Name of three components (Schema, Correspondence, and
$2: SELECT C.Caller, C.Callee, relname(C), F.File Mapping Engine) that interact with our internal
FROM Function F left outer join Calls C mapping knowledge base and with the user to pro-
WHERE C.Ca|lee = F.Name duce the desired mapping. Our initial implementa-
tion of Clio includes most of the functionalities of
To validate and choose among these mappings, Clio the Correspondence Engine and the Mapping Engine
will illustrate this mapping using the data view. Ex- described in this paper. Using the GUI's Schema.
ample data will be selected and displayed for the user. View, users can draw correspondences among the se-
The examples will illustrate the differences in the join lected source and target schemas and review the re-
paths used in each mapping selected by Clio. Hence, sulting mapping query (which is currently expressed
to illustrate $1, an example will be used of a function as an SQL View). The initial implementation of the
that calls at least one function but is itself not called Schema Engine includes schema readers for relational
(such a function will appear in the target associated and XML Schema sources. Augmentation of these

82 SIGMOD R e c o r d , Vol. 30, No. 1, M a r c h 2001


f2

co.s Ico,,or! c ool

Figure 3: Value correspondences used to m a p between the source schema the warehouse schema

schemas is possible through a knowledge discovery [HMN+99] L. M. Haas, R. J. Miller, B. Niswonger,


module that searches for keys and referential con- M. Tork Roth, P. M. Schwarz, and E. L. Wim-
straints using the underlying data. mers. Transforming Heterogeneous Data with
Database Middleware: Beyond Integration.
We envision a number of extension to the Mapping IEEE Data Engineering Bulletin, 22(1):31-
Engine. The mapping algorithm used in our proto- 36, 1999.
type can only handle correspondences from flat rela- [HMPR97] M. J. Harrold, R. J. Miller, A. Porter, and
tional source schemas into either relational or nested G. Rothermel. A Collaborative Investiga-
target schemas. We are working on generalizing tion of Program-Analysis-Based Testing and
this algorithm to handle correspondences from nested Maintenance. In International Workshop
source schemas into nested target schemas. The next on Experimental Studies of Software Mainte-
nance, pages 51-56, Bari, Italy, October 1997.
version of the system will be able to use source and
target d a t a constraints in combination with the input [HT01] H. C. T. Ho and X. Tian. Automatic Classi-
fication of Database Columns Using Feature
correspondences to validate the mapping (i.e., detect Analysis. Submitted for publication, 2001.
inconsistencies) and logically infer new mappings. If [MG99]
R. J. Miller and A. Gujarathi. Mining for
inconsistencies arise after the user enters a value cor- Program Structure. International Journal on
respondence, Clio should be able to explain the prob- Software Engineering and Knowledge Engi-
lem (e.g., violation of a key constraint) and suggest neering, 9(5):499-517, 1999.
fixes (e.g., modifying the correspondence or relaxing [MHH00] R. J. Miller, L. M. Haas, and M. Hermlndez.
a constraint). We are also looking into the ability to Schema Mapping as Query Discovery. In
invert mappings (when possible) which would allow Proc. of the Int'l Conf. on Very Large Data
Clio to be used for bi-directional exchange of data. Bases (VLDB), pages 77-88, Cairo, Egypt,
September 2000.
Ultimately, we view Clio as an extensible man-
agement platform on which we can build a host of [Mi198] R. J. Miller. Using Schematically Heteroge-
neous Structures. ACM SIGMOD Int'l Conf.
new integration and transformation techniques in- on the Management of Data, 27(2):189-200,
cluding perhaps a robust query facility for schemas, June 1998.
correspondences and mappings (for example, to per- [RR99] S. Ram and V. Ramesh. Schema Integra-
mit users to ask questions about mappings and their tion: Past, Current and Future. In A. Elma-
properties). An important theme in Clio, which we garmid, M. Rusinkiewicz, and A. Sheth, edi-
expect to continue, has been the use of data to help tors, Management of Heterogeneous and Au-
users to understand the results produced by each rea- tonomous Database Systems, pages 119-155.
Morgan Kaufmann Publishers, 1999.
soning component.
[TS97] M. Tork Roth and P. Schwarz. Don't Scrap It,
Wrap It! A Wrapper Architecture for Legacy
References Data Sources. In Proc. of the Int'l Conf. on
Very Large Data Bases (VLDB), pages 266-
[BGH99] 275, Athens, Greece, August 1997.
I. T. Bowman, M. W. Godfrey, and R. C.
Holt. Connecting Software Architecture Re- [YMHF01] L. Yah, R. J. Miller, L. Haas, and R. Fagin.
covery Frameworks. In Proceedings of the Data-Drivien Understanding and Refinement
First International Symposium on Construct- of Schema Mappings. Submitted for publica-
ing Software Engineering Tools (CoSET'99), tion, 2001.
Los Angeles, May 17-18 1999.
[BHP00] P. A. Bernstein, A. Y Halevy, and R. A. Pot-
tinger. A Vision for Management of Complex
Models. SIGMOD Record, 29(4):55-63, 2000.

SIGMOD Record, Vol. 30, No. 1, March 2001 83

You might also like