You are on page 1of 150

1

DISSERTATION

Data Integration against
Multiple Evolving Autonomous Schemata

ausgeführt zum Zwecke der Erlangung des akademischen Grades
eines Doktors der technischen Wissenschaften unter der Leitung
von

o. Univ.-Prof. Dr. Robert Trappl
Institut für medizinische Kybernetik und Artificial Intelligence
Universität Wien

und

Universitätslektor Dipl.-Ing. Dr. Paolo Petta
Institut für medizinische Kybernetik und Artificial Intelligence
Universität Wien

eingereicht an der Technischen Universität Wien
Fakultät für Technische Naturwissenschaften und Informatik

von

Christoph Koch
E9425227
A-1030 Wien, Beatrixgasse 26/70

Wien, am
2
3

Abstract
Research in the area of data integration has resulted in approaches such as fed-
erated and multidatabases, mediation, data warehousing, global information sys-
tems, and the model management/schema matching approach. Architecturally,
approaches can be categorized into those that integrate against a single global
schema and those that do not, while on the level of inter-schema constraints,
most work can be classified either as so-called global-as-view or as local-as-view
integration. These approaches differ widely in their strengths and weaknesses.
Federated databases have been found applicable in environments in which
several autonomous information systems coexist – each with their individual
schemata – and need to share data. However, this approach does not provide
sufficient support for dealing with change of schemata and requirements. Other
approaches to data integration which are centered around a single “global” inte-
gration schema, on the other hand, cannot handle design autonomy of information
systems. Under evolution, this type of autonomy eventually leads to schemata
between which neither the global-as-view nor the local-as-view approaches to
source integration can be used to express the inter-schema semantics.
In this thesis, this issue is addressed with a novel approach to data integration
which combines techniques from model management, mediation, and local-as-
view integration. It allows for the design of inter-schema mappings that are more
robust when change occurs. The work has been motivated by the requirements
of large scientific collaborations in high-energy physics, as encountered by the
author during his stay at CERN.
The approach presented here is based on two foundations. The first is query
rewriting with very expressive symmetric inter-schema constraints, called con-
junctive inclusion dependencies (cind’s). These are containment relationships
between conjunctive queries. We address a very general form of the source inte-
gration problem, in which several schemata may coexist, each of them containing
a number of purely logical as well as a number of source entities. For the source
entities, the information system that belongs to the schema holds data, while the
logical entities are meant to allow schema entities from other information systems
to be integrated against. The query rewriting problem now aims at rewriting a
query over (possibly) both source and logical schema entities of one schema into
source entities only, which may be part of any of the schemata known. Under the
classical logical semantics, and given a conjunctive input query, we address the
problem of finding maximally contained positive rewritings under a set of cind’s.
Such rewritten queries can then be optimized and efficiently answered using clas-
sical distributed database techniques. For the purpose of data integration and
the sake of computability, we require the dependency graph of a set of cind’s to
be acyclic with respect to inclusion direction.
Regarding the query rewriting problem, we first present semantics and main
theoretical properties. Subsequently, algorithms and optimizations based on tech-
4

niques from database theory are presented, which have been implemented in a
research prototype. Finally, experimental results based on this prototype are
presented, which demonstrate the practical feasibility of our approach.
Reasoning is done exclusively over schemata and queries, and is independent
from data volumes, which renders it highly scalable. Apart from that, this flavor
of query rewriting has another important strength. The expressiveness of the
constraints allows for much freedom and flexibility for modeling the peculiarities
of a mapping problem. For instance, both global-as-view and local-as-view inte-
gration are special cases of the query rewriting problem addressed in this thesis.
As will be shown, this flexibility allows to design mappings that are robust with
respect to change, as principles such as the decoupling of inter-schema dependen-
cies can be implemented. It is furthermore clear that query rewriting with cind’s
also permits to deal with concept mismatch in a very wide sense, as each pair of
corresponding concepts in two schemata can be modeled as conjunctive queries.
The second foundation is model management based on cind’s as inter-schema
constraints. Under the model management approach to data integration, sche-
mata and mappings are treated as first-class citizens in a repository, on which
model management operations can be applied. This thesis proposes definitions
of schemata and mappings, as well as an array of powerful operations, which are
well suited for designing and maintaining mappings between information systems
when change is an issue. To complete this work, we propose a methodology for
dealing with evolving schemata as well as changing integration requirements.
The combination of the contributions of this thesis brings a practical improve-
ment of openness and flexibility to the federated database and model management
approaches to data integration, and a first practical integration architecture to
large, complex, and evolving computing environments such as those encountered
in large scientific collaborations.
5

Inhaltsangabe
Forschung im Gebiet der Datenintegration hat u.a. Richtungen wie föderierte und
Multidatenbanken, Mediation, Data Warehousing, Global Information Systems
und Model Management bzw. Schema Matching zu Tage gebracht. Von einem
architektonischen Standpunkt aus gesehen kann zwischen Ansätzen unterschieden
werden, in denen gegen ein einziges globales Schema integriert wird, und solchen,
wo das nicht der Fall ist. Auf der Ebene der Interschemasemantik kann man den
Großteil der bisherigen Forschungsarbeit in die sogenannten global-as-view und
local-as-view Ansätze einteilen. Diese Ansätze unterscheiden sich teilweise stark
in ihren individuellen Eigenschaften.
Föderierte Datenbanken haben sich in Umgebungen als brauchbar erwiesen,
in denen mehrere Informationssysteme miteinander Daten austauschen müssen,
jedes dieser Informationssysteme aber sein eigenes Schema hat, und, was das
Design dieses Schemas betrifft, auch autonom ist. In der Praxis unterstützt dieser
Ansatz aber unangenehmerweise die Wartung von sich ändernden Schemata nicht.
Andere bekannte Ansätze, die gegen ein “globales” Schema integrieren, unterstü-
tzen hingegen die Design Autonomy von Informationssystemen nicht. Bei not-
wendig werdenden Schemaänderungen führt diese Art von Autonomie nämlich
oft zu Schemata, gegen die die erwünschte Interschemasemantik weder durch
global-as-view noch durch local-as-view-Ansätze ausgedrückt werden kann.
Diese Problematik ist das Thema dieser Dissertation, in der ein neuer Ansatz
zur Datenintegration, der Ideen von Model Management, Mediation, and local-
as-view Integration vereint, vorgeschlagen wird. Unser Ansatz ermöglicht die Mo-
dellierung von (partiellen) Abbildungen zwischen Schemata, die Änderungen eine
vorteilhafte Robustheit entgegensetzen. Die Motivation für die präsentierten Re-
sultate ist Folge eines ausgedehnten Aufenthalts des Autors am CERN, während-
dessen die die Informationsinfrastruktur betreffenden Ziele und Notwendigkeiten
von großen wissenschaftlichen Kollaborationen studiert wurden.
Unser Ansatz basiert auf zwei zentralen Grundlagen. Die erste ist Query
Rewriting, also das Umschreiben von Abfragen, unter sehr ausdrucksstarken
“symmetrischen” Interschemaabhängigkeiten, nämlich Inklusionsabhängigkeiten
zwischen sogenannten Conjunctive Queries, die wir Conjunctive Inclusion Depen-
dencies (cind’s) nennen. Wir behandeln eine sehr allgemeine Form des Quellen-
integrationsproblems, in dem mehrere Schemata koexistieren dürfen, und jedes
davon sowohl echte Datenbankentititäten, für die also Daten vorhanden sind,
sowie rein logische oder “virtuelle” Entititäten enthalten darf, gegen die mit Hilfe
von cind’s Abhängigkeiten von anderen Schemata definiert werden können. Das
Query Rewriting Problem zielt nun darauf ab, eine Abfrage, die sowohl über lo-
gische als auch echte Entititäten eines Schemas gestellt werden darf, so in eine
andere umzuschreiben, daß nur echte Datenbankentititäten, allerdings, wenn
nötig, von allen dem Integrationssystem bekannten Schemata, verwendet wer-
den. Exakter wird unter der klassisch-logischen Semantik mit Hilfe einer Menge
6

von cind’s eine Conjunctive Query in eine maximale logisch enthaltene positive
Abfrage umgeschrieben. Solch derart umgeschriebene Abfragen können mit Hilfe
von bekannten Techniken aus dem Gebiet der verteilten Datenbanken beant-
wortet werden. Aus theoretischen Überlegungen, die in dieser Dissertation näher
erläutert werden, beschränken wir uns dabei – für die Datenintegration – auf
Mengen von cind’s deren Abhängigkeitsgraph bezogen auf die Inklusionsrichtung
der cind’s azyklisch ist.
Was das Query Rewriting Problem betrifft stellen wir zuerst Semantik(en) und
theoretische Eigenschaften vor. Danach werden Algorithmen und Optimierungen,
die auf Datenbanktechniken aufbauen, präsentiert, die in einem Prototypen im-
plementiert wurden. Zu diesem werden auch passende Benchmarks geliefert, die
zeigen sollen, daß unser Ansatz leistungsfähig genug ist, um auch praktische
Relevanz zu besitzen.
Unser Ansatz skaliert ausgezeichnet zu großen Datenmengen, da das Daten-
integrationsproblem ausschließlich auf der Ebene von Schemata und Abfragen,
nicht aber auf der Ebene von Daten, gelöst wird. Eine weitere Stärke ist die
hohe Ausdruckskraft unserer Abhängigkeiten (cind’s), die viel Flexibilität bei
der Modellierung von Interschemabeziehungen erlaubt; beispielsweise sind sowohl
local-as-view als auch global-as-view Integration Spezialfälle unseres Ansatzes.
Wie auch gezeigt wird, erlaubt diese Flexibilität, Abbildungen zu erzeugen, die
Änderungen gegenüber robust sind, da sie es ermöglicht, cind’s weitgehend un-
abhängig voneinander zu machen, sodaß notwendige Änderungen meist lokal
beschränkt bleiben. Query Rewriting mit cind’s ermöglicht es klarerweise auch,
mit einer sehr großen Klasse von Disparitäten von Konzepten umzugehen, da
Paare von einander entsprechenden (um exakt zu sein, einander enhaltenden)
Konzepten durch zwei in Relation gebrachte Conjunctive Queries ausgedrückt
werden.
Die zweite Grundlage stellt Model Management mit cind’s dar. Im Model
Management Ansatz werden Schemata und Abbildungen als Objekte mit Iden-
tität verwaltet, auf die eine Anzahl von mächtigen Wartungs- und Manipulations-
operationen angewandt werden kann. In dieser Dissertation werden solche Op-
erationen definiert, die dafür passend sind, Abbildungen so zu verwalten, daß
häufige Änderungen handhabbar sind. Dazu wird auch eine Methodologie zum
Management von Schema Evolution präsentiert.
Die Kombination der technischen Beiträge dieser Dissertation ermöglicht eine
deutliche Verbesserung von Offenheit und Flexibilität für die Ansätze Model
Management und föderierte Datenbanken in der Datenintegration und stellt die
erste praktische Lösung der Datenintegrationsprobleme dar, denen im Kontext
von komplexen, autonomen und sich ändernden Informationslandschaften, wie es
große wissenschaftliche Kollaborationen sind, begegnet wird.
7

Acknowledgments
Most of the work on this thesis was carried out during a 30 months stay at CERN,
which was sponsored by the Austrian Federal Ministry of Education, Science and
Culture under the CERN Austrian Doctoral Student Program.
I would like to thank the two supervisors of my thesis, Robert Trappl of the
Department of Medical Cybernetics and Artificial Intelligence of the University
of Vienna and Jean-Marie Le Goff of CERN / ETT Division and the University
of the West of England for their continuous support. This thesis would not have
been possible without their help.
Paolo Petta of the Austrian Research Institute for Artificial Intelligence took
over much of the day-to-day supervision, and I am indebted to him for countless
hours of discussions, proofreading of draft papers, and feedback of any kind.
I would like to thank Enrico Franconi of the University of Manchester for
provoking my interest in local-as-view integration during his short visit at CERN
in early 2000, which has influenced this thesis. I am also indebted to Richard Mc-
Clatchey and Norbert Toth of the University of the West of England and CERN
for valuable comments on parts of an earlier version of this thesis. However,
mistakes, as is obvious, are entirely mine.
8
Contents

1 Introduction 13
1.1 A Brief History of Data Integration . . . . . . . . . . . . . . . . . 13
1.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Use Case: Large Scientific Collaborations . . . . . . . . . . . . . . 18
1.4 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . 23
1.5 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 Preliminaries 27
2.1 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Query Containment . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Global Query Optimization . . . . . . . . . . . . . . . . . . . . . 34
2.5 Complex Values and Object Identities . . . . . . . . . . . . . . . . 35

3 Data Integration 39
3.1 Definitions and Overview . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Federated and Multidatabases . . . . . . . . . . . . . . . . . . . . 41
3.3 Data Warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Information Integration in AI . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Integration against Ontologies . . . . . . . . . . . . . . . . 44
3.4.2 Capability Descriptions and Planning . . . . . . . . . . . . 45
3.4.3 Multi-agent Systems . . . . . . . . . . . . . . . . . . . . . 47
3.5 Global-as-view Integration . . . . . . . . . . . . . . . . . . . . . . 50
3.5.1 Mediation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.2 Integration by Database Views . . . . . . . . . . . . . . . 51
3.5.3 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Local-as-view Integration . . . . . . . . . . . . . . . . . . . . . . . 53
3.6.1 Answering Queries using Views . . . . . . . . . . . . . . . 54
3.6.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6.3 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . 60
3.7 Description Logics-based Information Integration . . . . . . . . . 62
3.7.1 Description Logics . . . . . . . . . . . . . . . . . . . . . . 62

9
10 CONTENTS

3.7.2 Description Logics as a Database Paradigm . . . . . . . . 63
3.7.3 Hybrid Reasoning Systems . . . . . . . . . . . . . . . . . . 65
3.8 The Model Management Approach . . . . . . . . . . . . . . . . . 65
3.9 Discussion of Approaches . . . . . . . . . . . . . . . . . . . . . . . 66

4 Reference Architecture 71
4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Mediating a Query . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Research Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Query Rewriting 75
5.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.1 The Classical Semantics . . . . . . . . . . . . . . . . . . . 78
5.3.2 The Rewrite Systems Semantics . . . . . . . . . . . . . . . 82
5.3.3 Equivalence of the two Semantics . . . . . . . . . . . . . . 84
5.3.4 Computability . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.5 Complexity of the Acyclic Case . . . . . . . . . . . . . . . 90
5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5.1 Chain Queries . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5.2 Random Queries . . . . . . . . . . . . . . . . . . . . . . . 97
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6 Model Management 99
6.1 Model Management Repositories . . . . . . . . . . . . . . . . . . . 99
6.2 Managing Change . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2.1 Decoupling Mappings . . . . . . . . . . . . . . . . . . . . . 103
6.2.2 Merging Schemata . . . . . . . . . . . . . . . . . . . . . . 107
6.3 Managing the Acyclicity of Constraints . . . . . . . . . . . . . . . 108

7 Outlook 111
7.1 Physical Data Independence . . . . . . . . . . . . . . . . . . . . . 113
7.1.1 The Classical Problem . . . . . . . . . . . . . . . . . . . . 113
7.1.2 Versions of Logical Schemata . . . . . . . . . . . . . . . . 117
7.2 Rewriting Recursive Queries . . . . . . . . . . . . . . . . . . . . . 122

8 Conclusions 127
List of Figures

1.1 Mappings in LAV (left) and GAV (right). . . . . . . . . . . . . . . 15
1.2 The space of objects that can be shared using symmetric map-
pings given true concept mismatch between entities of source and
integration schemata. . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Data flow between information systems that manage the steps of
an experiment’s lifecycle. . . . . . . . . . . . . . . . . . . . . . . . 20
1.4 ER diagrams for Example 1.3.1: Electronics database (left) and
product-data management system (right). . . . . . . . . . . . . . 21
1.5 Concept mismatch between PCs of the electronics database and
parts of the product-data management system of “Project1”. . . . 22
1.6 Architecture of the information infrastructure . . . . . . . . . . . 24

3.1 Artist’s impression of source integration. . . . . . . . . . . . . . . 40
3.2 Federated 5-layer schema architecture . . . . . . . . . . . . . . . . 42
3.3 Data warehousing architecture and process. . . . . . . . . . . . . 43
3.4 MAS architectures for the intelligent integration of information.
Arrows between agents depict exemplary communication flows.
Numbers denote logical time stamps of communication flows. . . . 48
3.5 A mediator architecture . . . . . . . . . . . . . . . . . . . . . . . 51
3.6 MiniCon descriptions of the query and views of Example 3.6.1. . . 58
3.7 Comparison of global-as-view and local-as-view integration. . . . . 67
3.8 Comparison of Data Integration Architectures. . . . . . . . . . . . 68

4.1 Reference Architecture . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1 Hypertile of size i ≥ 2 (left) and the nine possible overlapping
hypertiles of size i − 1 (right). . . . . . . . . . . . . . . . . . . . . 91
5.2 Experiments with chain queries and nonlayered chain cind’s. . . . 95
5.3 Experiments with chain queries and two layers of chain cind’s. . . 96
5.4 Experiments with chain queries and five layers of chain cind’s. . . 96
5.5 Experiment with random queries. . . . . . . . . . . . . . . . . . . 97

6.1 Operations on schemata. . . . . . . . . . . . . . . . . . . . . . . . 100
6.2 Operations on mappings. . . . . . . . . . . . . . . . . . . . . . . . 100

11
12 LIST OF FIGURES

6.3 Complex model management operations. . . . . . . . . . . . . . . 101
6.4 Data integration infrastructure of Example 6.2.1. Schemata are
visualized as circles and elementary mappings as arrows. . . . . . 104
6.5 The lifecycle of the mappings of a legacy integration schema. . . . 106
6.6 Merging auxiliary integration schemata to improve maintenance. . 107
6.7 A clustered auxiliary schema. Schemata are displayed as circles
and mappings as arrows. . . . . . . . . . . . . . . . . . . . . . . . 108

7.1 A cind as an inter-schema constraint (A) compared to a data trans-
formation procedure (B). Horizontal lines depict schemata and
small circles depict schema entities. Mappings are shown as thin
arrows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2 EER diagram of the university domain (initial version). . . . . . . 114
7.3 EER diagram of the university domain (second version). . . . . . 118
7.4 Fixpoint of the bottom-up derivation of Example 7.2.1. . . . . . . 123
Chapter 1

Introduction

The integration of heterogeneous databases and information systems is an area
of high practical importance. The very success of information systems and data
management technology in a short period of time has caused the virtual om-
nipresence of stand-alone systems that manage data – “islands of information” –
that by now have grown too valuable not to be shared. However, this sharing, and
with it the resolution of heterogeneity between systems, entails interesting and
nontrivial problems, which have received much research interest in recent years.
Ongoing research activity, however, is evidence of the fact that many questions
remain unanswered.

1.1 A Brief History of Data Integration
Given a number of heterogeneous information systems, in practice it is not al-
ways desirable or even possible to completely reengineer and reimplement them
to create one homogeneous information system with a single schema (schema
integration [BLN86, JLVV00]). Instead, it is often necessary to perform data
integration [JLVV00], where schemata of heterogeneous information systems are
left unchanged and integration is carried out by transforming queries or data.
To realize such transformations, some flavor of mappings (either procedural code
or declarative inter-schema constraints) between information systems is required.
If the data integration reasoning is entirely effected on the level of queries and
schema-level descriptions, this is usually called query rewriting, while the term
data transformation refers to heterogeneous data themselves being classified,
transformed and fused to appear homogeneous under some integration schema.
Most previous work on data integration can be classified into two major di-
rections by the method by which inter-schema mappings used for integration
are expressed (see e.g. [FLM98, Ull97]). These are called local-as-view (LAV)
[LMSS95, YL87, LRO96, GKD97, AK92, TSI94, CKPS95] and global-as-view
(GAV) [GMPQ+ 97, ACPS96, CHS+ 95, FRV95] integration.

13
14 CHAPTER 1. INTRODUCTION

The more traditional paradigm is global-as-view integration, where mappings
– often called mediators after [Wie92] – are defined as follows. Mediators imple-
ment virtual entities (concepts, relations or classes, depending on nomenclature
and data model used) exported by their interfaces as views over the heteroge-
neous sources, specifying how to combine their data to resolve some (or all) of
the experienced heterogeneity. Such mediators can be (generalizations of) simple
database views (e.g. CREATE VIEW constructs in SQL) or can be implemented
by some procedural code. Global-as-view integration has been used in multi-
databases [SL90], data warehousing [JLVV00], and recently for the integration of
multimedia sources [ACPS96, CHS+ 95] and as a fertile testbed for semistructured
data models and technologies [GMPQ+ 97].
In the local-as-view paradigm, inter-schema constraints are defined in strictly
the opposite way1 . Queries over a purely logical “global” mediated schema are
answered by treating sources as if they were materialized views over the medi-
ated schema, where only these materialized views may be used to answer the
query – after all, the mediated schema does not directly represent any data.
Query answering then reduces to the so-called problem of answering queries
using views, which has been intensively studied by the database community
[LMSS95, DGL00, AD98, BLR97, RSU95] and is related to the query containment
problem [CM77, CV92, Shm87, CDL98a]. Local-as-view integration has not only
been applied to and shown to be well-suited for data integration in global infor-
mation systems [LRO96, GKD97, AK92], but also in related applications beyond
data integration, such as query optimization [CKPS95] and the maintenance of
physical data independence [TSI94].
An important distinction is to be made between data integration architectures
that are centered around a single “global” integration schema against which all
sources are integrated (This is the case, for instance, for data warehouses and
global information systems, and is intrinsic to the local-as-view approach.) and
others that are not, such as federated and multidatabases. The lack of a single
global integration schema in the data integration architecture has a problematic
consequence. Each source may need to be mapped against each of the integration
schemata, leading to a large number of mappings that need to be created and
managed. In architectures such as those of federated database systems where
each component database may be a source and a consumer of integrated data at
once, a quadratic number of mappings may be required.
The globality of integration schemata is usually judged by their role in an
integration architecture. Global schemata are singletons that occupy a very cen-
tral role in the architecture, and are unique consistent and homogeneous world
views against which all other schemata in the system (usually considered the

1
At first sight, this may appear unintuitive, but is not. For instance, the local-as-view
approach can be motivated by AI planning for information gathering using content descriptions
of sources in terms of a global world model (as “planning operators”) [AK92, KW96].
1.1. A BRIEF HISTORY OF DATA INTEGRATION 15

space of tuples
expressible as queries
source over the sources
space of tuples 1
expressible as
queries over the
global schema source tuples
2 accessible
source through
3 mediators

LAV GAV

Figure 1.1: Mappings in LAV (left) and GAV (right).

“sources”) are to be integrated. There is globality in integration schemata on a
different level as well. We want to consider integration schemata as designed at
will while taking a global perspective if

• they are artifacts specifically created for the resolution of some heterogene-
ity and

• the entirety of sources in the system that have any relevance to those het-
erogeneity problems addressed have been taken into account in the design
process.

Thus in such “global” schemata, a global perspective has been taken when
designing them. However, they do not have to be monolithic homogeneous world
views. This qualifies the collection of logical entities exported by mediators in
a global-as-view integration system as a specifically designed global integration
schema, although such a schema is not necessarily homogeneous.
An important characteristic of data integration approaches is how well concept
mismatch occurring between source and integration schemata can be bridged. We
have pointed out that both GAV and LAV use a flavor of views for the mapping
between sources and integration schemata. In Figure 1.1, we compare the local-as-
view and global-as-view paradigms by visualizing (by Venn diagrams) the spaces
of tuples (in relational queries) or objects that can be expressed by queries over
source and integration schemata.
Views as inter-schema constraints are strongly asymmetric. One single atomic
schema entity appearing in a schema on one side of the invisible conceptual border
line between integration and source schemata is always defined by a query or (as
the general idea of mediation permits) by some procedural code which computes
16 CHAPTER 1. INTRODUCTION

the entity’s extent over the schemata on the other side of that border line. As a
consequence, both LAV and GAV are restricted in how well they can deal with
concept mismatch2 .
This restriction is theoretical, because in both LAV and GAV it is always
implicitly assumed that sources are integrated against integration schemata that
have been freely designed with no other constraints imposed than the current
integration requirements3 . However, when data need to be integrated against
schemata of information systems that have design autonomy, or when integration
schemata have a legacy 4 burden that an integration approach has to be able to
deal with, both LAV and GAV fail.
Note that views are not the only imaginable way of mapping schemata in
data integration architectures. For mappings that are not expressible as views,
it may be possible to relate the spaces of objects expressible by complex logical
expressions – say queries – over the concepts of the schemata (see Figure 1.2).
“Legacy” integration schemata are faced when

• there is no central design authority providing “global” schemata,

• future integration requirements or changes to schemata of information sys-
tems cannot be appropriately predicted,

• existing integration schemata cannot be amended when integration require-
ments or the nature of sources to be made available change in an unforeseen
way, or

• the creation of “global” schemata is infeasible because of the size and com-
plexity of the problem domain and modeling task5 [MKW00].

Recent work in the area has resulted in two new approaches that do not center
around a single “global” integration schema and where inter-schema constraints
do not necessarily have that strictly asymmetric syntax encountered in LAV and
GAV. The first uses expressive description logics systems with symmetric con-
straints for data integration [CDL98a, CDL+ 98b, Bor95]. Constraints can be
2
See Example 1.3.1 and [Ull97].
3
This makes the option of the change of requirements or the nature of sources after the design
of the integration schemata has been finished hover over such architectures like Damocles’ sword.
4
We do not refer to the legacy systems issue here, though. In principle, legacy systems are
operational systems that in some aspect of their design differ from what they ideally should
be like; they use at least one technology that is no longer part of the current overall strategy
in some enterprise or collaborative environment [AS99]. In practice, information systems are
usually referred to as legacy in the context of data integration if they are not even based on a
modern data management technology, usually making it necessary to treat them monolithically,
and “wrap” them [GMPQ+ 97, RS97] by software that makes them appear to respond to data
requests under a state-of-the-art data management paradigm.
5
This may make the Semantic Web effort of the World Wide Web Consortium [Wor01] seem
to be threatened by another very sharp blade hanging by an amazingly fragile thread.
1.2. THE PROBLEM 17

space of tuples that can
be made space of
tuples
available to tuples
expressible
queries over the expressible
as queries
integrated as queries
over the
schema by over the
global
mappings from sources
schema
sources

Figure 1.2: The space of objects that can be shared using symmetric mappings
given true concept mismatch between entities of source and integration schemata.

defined as containment relationships between complex concepts that represent
(path) queries. The main drawback is that integration has to be carried out as
ABox reasoning [CDL99], i.e. the classification of data in a (hybrid) description
logics system [Neb89]. This does not scale well to large data volumes. Further-
more, such an approach is not applicable when sources have restricted interfaces
(as is often the case on the Web) and it is not possible to import all data of a
source into the reasoning system.
The second approach, model management [BLP00, MHH+ 01], treats schemata
and mappings between schemata as first-class objects that can be stored in a
repository and manipulated with cleanly defined model management operations.
This direction is still in an early stage and no convergence against clean, widely
usable semantics has occurred yet. Mappings are often defined as lines between
concepts (e.g. relations or classes in schemata) using an array of semantics that
are often not very expressive. While such approaches allow for neat graphical
visualization and the editing of mappings, they do not provide the mechanisms
and expressive semantics to support design and modeling actions to make evolving
schemata manageable.

1.2 The Problem
The problem addressed in this thesis is the following. We aim at an approach to
data integration that satisfies three requirements.

• Individual information systems may have design autonomy for their sche-
mata. In general, no global schemata can be built. Each individual schema
may have been defined before integration requirements were completely
known, and be ill-suited for a particular integration task.
18 CHAPTER 1. INTRODUCTION

• Individual schemata may evolve independently. Even the best-designed
integration schemata may end up with concept mismatch that cannot be
dealt with through view-based mappings.
• The third requirement concerns the scalability of the approach. The data
integration problem has to be solved entirely on the level of queries and
descriptions of information systems (i.e., query rewriting) rather than the
level of reasoning over the data to ensure the independence of the approach
from the amount of data managed.
Given the problem that the number of mappings in data integration architec-
tures with autonomous component systems may be quadratic in the number of
schemata and thus very large, the option that schemata and integration require-
ments change renders a way of managing schemata and mappings necessary that
is simple and for which many tasks can be automated. This requires support for
managing mappings and their change and reusing mappings both actively, in the
actions performed for managing schemata and mappings, and passively, through
the transitivity of their semantics6 .
The work presented in this thesis has been carried out in the context of a very
large international scientific collaboration in the area of high-energy physics. We
will have a closer look at the problem of providing interoperability of information
systems in that domain in Section 1.3.

1.3 Use Case: Large Scientific Collaborations
Large scientific collaborations are becoming more and more common due to the
fact that nowadays cutting-edge scientific research in areas such as high energy
physics, the human genome or aerospace has become extremely expensive. Data
integration is an issue since many of the individual information systems being
operated in such an environment require integrated data to be provided from
other information systems in order to work. As we will point out in this section,
the main sources of difficulty related to source integration in the information
infrastructures of such collaborations are the design autonomy of information
systems, change of requirements and evolution of schemata, and large data sets.
A number of issues stand in the way of building a single unified “global”
logical schema (as they exist for data warehouses or global information systems)
for a large science project. We will summarize them next.
Heterogeneity. Heterogeneity is pervasive in large scientific research collabora-
tions, as there are existing legacy systems as well as largely autonomous groups
that build more such legacy systems.
6
That is, given that we have defined a mapping from schema A to schema B and a mapping
from schema B to schema C, we assume that we automatically arrive at a mapping from schema
A to schema C.
1.3. USE CASE: LARGE SCIENTIFIC COLLABORATIONS 19

Scientific collaborations consist of a number7 of largely autonomous institutes
that independently develop and maintain their individual information systems8 .
This lack of central control fosters creativity and is necessary for political and
organizational reasons. However, it leads to problems when it comes to mak-
ing information systems interoperate. In such a setting, heterogeneity arises due
to many reasons. Firstly, no two designers would conceptualize a given prob-
lem situation in the same way. Furthermore, distinct groups of researchers have
fundamentally different ways of dealing with bodies of knowledge, due to differ-
ent (human) languages, professional background, community or project jargon9 ,
teacher and curriculum, or “school of thought”. Several subcommunities inde-
pendently develop and use similar but distinct software for the same tasks. As a
consequence, one can assume similar but slightly different schemata10 . In an en-
vironment such as the Large Hadron Collider (LHC) project at CERN [LHC] and
huge experiments such as CMS [CMS95] currently under preparation, potentially
hundreds of individual information systems will be involved with the project dur-
ing its lifetime, some of them commercial products, others homegrown efforts of
possibly several hundred person years. This is the case because even for the same
task, sub-collaborations or individual institutes working on different subprojects
independently build systems.
When it comes to types of heterogeneity that may be encountered in such an
environment, it has to be remarked that beyond heterogeneity due to discrepan-
cies in conceptualizations of human designers (including polysemy, terminological
overlap and misalignment), there is also heterogeneity that is intrinsic to the do-
main. For example, in the environment of high-energy physics experiments (say,
a particle detector), detector parts will be necessarily conceptualized differently
depending on the kind of information system in which they are represented. For
instance, in a CAD system that is used for designing the particle detector, parts
will be spatial structures; in a construction management system, they will have
to be represented as tree-like structures modeling compositions of parts and their
sub-parts, and in simulation and experimental data taking, parts have to be
aggregated by associated sensors (readout channels), with respect to which an
experiment becomes a topological structure largely distinct from the one of the
design drawing. We believe that such differences also lead to different views on
the knowledge level, and certainly lead to different database schemata.
Hardness of Modeling. Apart from the notion of intrinsic heterogeneity that
we have given rise to in the previous paragraph, there are a number of other
issues that contribute to the hardness of modeling in a scientific domain. Firstly,
7
In large collaborations, they may amount to hundreds.
8
The requirements presented here closely relate to classifications of component autonomy in
federated databases [HM85].
9
Such jargon may have developed over time in previous projects in which a group of people
may have worked on together.
10
Unfortunately, it is often trickier to deal with subtle than with great mismatch.
20 CHAPTER 1. INTRODUCTION

Design Simulation

Human
Resources
Construction

Precalibration
& Testing Finance

Detector
Control Calibration

Maintenance

Event Simulation
Decomissioning Reconstruction

Figure 1.3: Data flow between information systems that manage the steps of an
experiment’s lifecycle.

overall agreement on a conceptualization of a large real-world domain cannot be
achieved. Whenever new requirements are discovered or a better understanding
of a domain is achieved, there will be an incentive to change the current schema.
Such change may go beyond pure extension. Instead, existing parts of schemata
will have to be revisited, invalidating mappings for data integration that rely on
these schemata. Global modeling also fails because of the sheer size of such a
scientific domain. In fact, in a project that involves the collaboration of several
thousand researchers and engineers, to be able to model the domain would require
to have access to all the knowledge in the heads of all the people involved, and
for this knowledge to be stable. This, however, is an unrealistic conjecture, all
the more so in an experimental research environment.
The Project Lifecycle. It is important to note that large science projects have
a lifecycle much like industrial projects; that is, they go through stages such as
design, simulation, construction, testing, calibration, deployment, decommission-
ing, and many more11 . Such steps have some temporal overlap in practice, but
there is a gross ordering. Large science projects persist for large time spans12 . As
a consequence, the information systems for some steps of the lifecycle will not be

11
See Figure 1.3 for an example of data flows that may need to occur between (heterogeneous)
information systems for the various activities in the lifecycle, all requiring data integration.
12
For example, the LHC project is expected to be carried on for at least 15 years.
1.3. USE CASE: LARGE SCIENTIFIC COLLABORATIONS 21

part_of

pc_cpu
pc cpu part

pc_location part_location

location location

id name id name

Figure 1.4: ER diagrams for Example 1.3.1: Electronics database (left) and
product-data management system (right).

built until other information systems have already been in existence for years.
In such an experimental setting, full understanding of the requirements for
subsequent information systems can often only be achieved once that information
systems for the current work have been implemented. Nevertheless, since some
information systems are already in need of data integration, one either has to
build a global logical schema today which might become invalid later, leading
to serious maintenance problems of the information infrastructure (that is, the
logical views that map sources), or an approach has to be followed that goes
without such a schema. Since it is impossible to preview all the requirements
of a complex system far into the future, one cannot avoid the need for change
through proper a priori design.
Concept Mismatch. It is clear from the above observations that concept mis-
match between schemata relevant to data integration may occur in the domain
of high energy physics research.

Example 1.3.1 Assume there are two information systems, the first of which
is a database holding data on electronics components13 of an experiment under
construction, with the relational schema

R1 = {pc cpu(P c, Cpu), pc location(P c, LocId), location(LocId, LocName)}

The database represents information about PCs and their CPUs as well as
the location where these parts currently are to be found. Locations have a name
13
To make the example more easily accessible, we speak of personal computers as sole elec-
tronics parts represented. Of course, personal computers are not representative building blocks
of high-energy physics experiments.
22 CHAPTER 1. INTRODUCTION

destination source
schema schema

Parts of
PCs
Project1

PCs of Project1

Figure 1.5: Concept mismatch between PCs of the electronics database and parts
of the product-data management system of “Project1”.

and an identifier. The second system is a product data management system for
a subproject “Project1” with the schema

R2 = {part of(P art1, P art2), part location(P art, LocId),
location(LocId, LocName)}

(see also Figure 1.4). The second database schema represents an assembly
tree of “Project1” by the relation “part of” and again the locations of parts.
Let us now assume that the first information system (the electronics database)
holds data that should be shared with the second. We assume that while the
names of the locations are the same in the second as in the first information
system, the domains of the location ids in the two information systems must be
assumed to be distinct, and cannot be shared.
We thus experience two kinds of complications with this integration problem.
The distinct key domains for locations in the two information systems in fact
entail that a correspondence between (derived) concepts in the two schemata is
to be made that are both to be defined by queries14 . Furthermore, we observe
concept mismatch. The first schema only contains electronics parts but may do
so for other projects besides “Project1” as well while in the second schema only
parts of “Project1” are to be represented, but those parts are not restricted to
electronics parts (Figure 1.5).
As a third complication in this example, we assume some granularity mis-
match. Assume that the second information system is to hold a more detailed
model of “Project1” than the first and shall represent CPUs as parts of main-
boards of PCs and those in turn as parts of PCs, rather than just CPUs as parts
of PCs. Of course, we have no information on mainboards in the electronics
database, but this information could be obtained from another source.
14
Thus, this correspondence could neither be expressed in GAV nor in LAV.
1.4. CONTRIBUTIONS OF THIS THESIS 23

We could encode this by the following semantic constraint expressing a map-
ping between schemata by a containment relationship between two queries:

{hP c, Cpu, LocNamei |
∃Mb, LocId : R2 .part of(Mb, P c) ∧ R2 .part of(Cpu, Mb) ∧
R2 .location(LocId, LocName) ∧
R2 .part location(P c, LocId)} ⊇
{hP c, Cpu, LocNamei |
∃LocId : R1 .pc cpu(P c, Cpu) ∧ R1 .belongs to(P c, “Project1”) ∧
R1 .location(LocId, LocName) ∧ R1 .pc location(P c, LocId)}

Informally, one may read this constraint as
PCs together with their CPUs and locations which are marked as belong-
ing to ‘Project1’ in the first information system should be part of the
answers to queries over parts and their locations in the second informa-
tion system, where CPUs should be known as parts two levels below PCs
in the assembly hierarchy represented by the part of relation.
We do not provide any formal semantics of such constraints for data integra-
tion at this point, but rely on the intuition that such a containment constraint
between two queries expresses the desired inter-schema dependency and allows,
given appropriate reasoning algorithms (if they exist), to perform data integration
in the presence of concept mismatch in a wide sense. 

Large Data Sets. Scientific computing has always been known for manipulating
very large amounts of data. Data volumes in information systems related to the
construction of LHC experiments are expected to be in the Terabyte range, and
experimental data collected during the lifetime of LHC will amount to dozens of
Petabytes. For scalability reasons, information integration has to be carried out
on the level of queries (query rewriting) rather than data (data transformation).

1.4 Contributions of this Thesis
This thesis is, to the best of our knowledge, the first to actually address the
problem of data integration with multiple unsophisticated evolving autonomous
integration schemata. Each such schema may consist of both source relations that
hold data and logical relations that do not. Schemata may be designed without
taking other schemata or data integration considerations into account. Each
query over a schema is rewritten into a query exclusively over source relations of
information systems in the environment, using a number of schema mappings.
We propose an approach to data integration (see Figure 1.6) based on model
management and query rewriting with expressive constraints within a federated
architecture. Our flavor of query rewriting is based on constraints with clean,
24 CHAPTER 1. INTRODUCTION

Repository Information systems
Editor

n
tio
sla
n
schema

Tra
data
Schemata Query
Rewriting Mediator
Proxy Proxy Query Facility
relational
schema Phys. Plan Mediator
Generation Proxy Query Facility

Query Plan Mediator
Proxy
Mappings Execution Query Facility

Repository Mediator

Figure 1.6: Architecture of the information infrastructure

expressive semantics. It allows for mappings between schemata that are general-
izations of both the LAV and GAV paradigms.
Regarding query rewriting, we first provide characterizations of two different
semantics for query rewriting with symmetric constraints, a classical logical and
one that is motivated by rewrite systems [DJ90]. The rewrite systems semantics
is based on the intuitions of local-as-view rewriting and generalizes from them.
We formally outline both semantics as well as algorithms for both which, given a
conjunctive query, enumerate the maximally contained rewritings15 . We discuss
various relevant aspects of query rewriting in our context, such as minimality and
nonredundancy of conjunctive queries in the rewritings. Next we compare the
two semantics and argue that the second is more intuitive and may fit better the
expectations of human users of data integration systems than the first. Following
the philosophy of that semantics, rewritings can be computed by making use of
database techniques such as query optimization and ideas from e.g. algorithms
developed for the problem of answering queries using views. We believe that in
a practical information integration context there are certain regularities (such as
sets of predicates – schemata – from which predicates are used together in queries,
while there are few queries that combine predicates from several schemata) that
render this approach more efficient in practice. Surprisingly, however, it can be
shown that the two semantics coincide. We then present a scalable algorithm for
the rewrite systems semantics (based on previous work such as [PL00]), which
we have implemented in a practical system16 , CindRew . We evaluate it experi-
mentally against other algorithms for the same problem. It turns out that our
implementation, which we make available for download, scales to thousands of

15
The notion of maximally contained rewritings is the one that usually best describes the
intuitive idea of “best rewritings possible” in a data integration context.
16
This system can be checked out at http://home.cern.ch/∼chkoch/cindrew/
1.5. RELEVANCE 25

constraints and realistic applications. We conclude with a discussion of how our
query rewriting approach fits into state-of-the-art data integration and model
management systems.
Regarding model management, we present definitions of data models, sche-
mata, mappings, and a set of expressive model management operations for the
management of schemata in a data integration setting. We argue that our ap-
proach can overcome the problems related to “unsophisticated” legacy integration
schemata, and provide a sketch of a methodology for managing evolving map-
pings.

1.5 Relevance
As we discuss a framework for data integration that is based on very weak as-
sumptions, this paper is relevant to a large number of applications in which
other approaches eventually fail. These include networks of autonomous virtual
enterprises having different deployment lifecycles or standards for their informa-
tion systems, the information infrastructure of large international collaborations
(e.g., in science), and large enterprises that face the integration of several exist-
ing heterogeneous data warehouses after mergers or acquisitions or major change
of business model. More generally, our work is applicable in simply any envi-
ronment in which anything less than full commitment exists towards far-ranging
reengineering of information systems to bring all information systems that roam
its environment under a single common enterprise model. Obviously, our work
may also allow federated databases [HM85, SL90] to deal more successfully with
schema evolution.
Let us reconsider the point of design autonomy for schemata of information
systems in the case of companies and e-commerce. For many good reasons, com-
panies nowadays want to have their information systems interoperate; however,
there is no sufficiently strong trend towards agreeing on schemata. While there
is clearly much work done towards standardization, large players in IT have an
incentive to propose competing “standards” and bodies of meta-data. Asking
for common schemata beyond enterprise boundaries today is hardly realistic.
Instead, even the integration of the information systems inside a single large
enterprise is a problem almost too hard to solve17 , and motivates some indepen-
dence of the information infrastructure of horizontal or vertical business units,
again leading to the legacy integration schema problem that we want to address
here. That mentioned, the work in this thesis is highly relevant to business-
to-business e-commerce and the management of the extended supply chain and
17
This of course excludes the issue of data warehouses, which, although they have a global
scope w.r.t. the enterprise, address only a small part of the company data (in terms of schema
complexity, not volume) – such as sales information – that are usually well understood and
where requirements are not expected to change much in the future.
26 CHAPTER 1. INTRODUCTION

virtual enterprises.
Data warehouses that have been the results of large and very expensive de-
sign and reengineering efforts customized to a specific enterprise really are legacy
systems from the day when their design phase ends. Similarly, when companies
merge, the schemata of those data warehouses that the former entities created
are again bound to feature a substantial degree of heterogeneity. This can be ap-
proached in two ways, either by considering these schemata legacy or by creating
a new, truly global information system (almost) from scratch.

1.6 Overview
The remainder of this thesis is structured as follows. In Chapter 2, some pre-
liminary notions from database theory, computability theory, and complexity
theory are presented. Chapter 3 discusses previous work on data integration.
We start with definitions in Section 3.1 and consecutively discuss federated and
multidatabases, data warehousing, mediator systems, information integration in
AI, global-as-view and local-as-view integration (the latter is presented at some
length, since its theory will be highly relevant to our work of Chapter 5), the
description logics-based and model management approaches to data integration,
and finally, in Section 3.9, we discuss the various approaches by maintainabil-
ity and other aspects. In Chapter 4, we present our reference architecture for
data integration and discuss its building blocks, which will be treated in more
detail in consecutive chapters. Chapter 5 presents our approach to query rewrit-
ing with expressive symmetric constraints. Chapter 6 first discusses our flavor
of schemata, mappings and model management operations, and then provides
some thoughts on how to guide the modeling process for mappings such that the
integration infrastructure can be managed as easily as possible. We discuss some
advanced issues of query rewriting, notably extensions of query languages such
as recursion and sources with binding patterns in Chapter 7. We also discuss an-
other application of our work on query rewriting with symmetric constraints, the
maintenance of physical data independence under schema evolution. Chapter 8
concludes with a final discussion of the practical implications of this thesis.
Chapter 2

Preliminaries

This chapter discusses some preliminaries which mainly stem from database the-
ory and which will be needed in later chapters. It is beyond the scope of this
thesis to give a detailed account of computability theory and complexity theory.
We refer to [HU79, Sip97, GJ79, Joh90, Pap94, DEGV] for introductory texts in
these areas. We also assume a basic understanding of databases, schemata, and
query languages, and notably SQL (for an introductory work on this see [Ull88]).
Finally, we presume basic understanding of mathematical logics and automated
theorem proving, including concepts such as resolution and refutation, and no-
tions such as predicates, atoms, terms, Skolem function, Horn clauses, and unit
clauses, which are used in the standard way (see e.g. [RN95, Pap94]).
We define the following access functions for later use: Given a Horn clause c,
Head(c) returns c’s head atom and Body(c) returns the ordered list of its body
atoms. Bodyi (c) returns the i-th body atom. P red(a) returns the predicate name
of atom a, while P reds(Body(c)) returns the predicate names of the atoms in the
body of clause c. V ars(a) returns the set of variables appearing in atom a and
V ar(Body(c)) returns the variables in the body of the clause c.
We will mainly focus on the relational data model and relational queries
[Cod70, Ull88, Ull89, Kan90] under a set-based rather than bag-based seman-
tics (That is, answers to queries are sets, while they are bags in the original
relational model [Cod70] and SQL).

2.1 Query Languages
Let dom be a countably infinite domain of atomic values. A relation schema R
is a relation name together with a sort, which is a tuple 1 of attribute names, and
an arity, i.e.
1
Relation schemata are usually defined as sets of attributes. However, we choose the tuple,
as we will use the unnamed calculus perspective widely throughout this work.

27
28 CHAPTER 2. PRELIMINARIES

sort(R) = hA1 , . . . , An i arity(R) = n

A (relational) schema R is a set of relation schemata. A relation I is a finite
set of tuples, I ⊆ domn . A database instance I is a set of relations.
A relational query Q is a function that maps each instance I over a schema
R and dom to another instance J over a different schema R’.
Relational queries can be seen from at least two perspectives, an algebraic
and a calculus viewpoint. Relational algebra ALG is based on the following basic
algebraic operations (see [Cod70] or [Ull88, AHV95]):

• Set-based operations (intersection ∩, union ∪, and difference \) over rela-
tions of the same sort (that is, arity, as we assume a single domain dom of
atomic values).

• Tuple-based operations (projection π, which eliminates or renames rows of
relations, and selection σ, which filters tuples of a relation according to a
predicate built by conjunction of equality atoms, which are statements of
the form A = B, where A, B are relational attributes).

• The cartesian product × as a constructive operation that, given two rela-
tions R1 and R2 of arities n and m, respectively, produces a new relation
of arity n + m which contains a tuple ht1 , t2 i for each distinct pair of tuples
t1 , t2 with t1 ∈ R1 and t2 ∈ R2 .

Other operations (e.g., various kinds of joins) can be defined from these.
There are various subtleties, such as named and unnamed perspectives of ALG,
for which we refer to [AHV95].
Queries in the first-order relational domain calculus CALC are of the form

{hX̄i | Φ(X̄)}

where X̄ is a tuple of variables (called “unbound” or “distinguished”) and Φ
is a first-order formula (using ∀, ∃, ∧, ∨, and ¬) over relational predicates pi .
An important desirable property of well-behaved database queries is domain
independence. Let the set of all atomic values appearing in a database I be
called the active domain (adom). A CALC query Q over a schema R is domain
independent iff, for any possible database I over R, Qdom (I) = Qadom (I).

Example 2.1.1 The CALC query {hx, yi | p(x)} is not domain independent, as
the variable y is free to bind with any member of the domain. Clearly, such a
query does not satisfy the intuitions of well-behaved database queries. 
2.1. QUERY LANGUAGES 29

Unfortunately, the domain independence property is undecidable for CALC.
An alternative purely syntactic property is safety or range restriction. We refer
to [AHV95] for a treatment of safe-range calculus CALCsr , which is necessarily
somewhat lengthy. It can be shown that ALG, the domain independent relational
calculus and CALCsr are all (language) equivalent.
We refer to the class of ∀, ¬-free queries as the positive relational calculus
queries and the queries that only use ∃ and ∧ to build formulae as the conjunctive
queries. By default, conjunctive queries may contain constants but no built-in
arithmetic comparison operators.
Conjunctive queries can be written as function-free Horn clauses, called dat-
alog notation. A conjunctive query {hX̄i | ∃Ȳ : p1 (X̄1 ) ∧ . . . ∧ pn (X̄n )} is written
as a datalog rule

q(X̄) ← p1 (X̄1 ), . . . , pn (X̄n ).

Furthermore, conjunctive queries have to be safe. Safety in the case of con-
junctive queries is quite simple to define. A conjunctive query is safe iff each
variable in the head also appears somewhere in the atoms built from database
predicates in the body, X̄ ⊆ X̄1 ∪ . . . ∪ X̄n . Throughout this thesis, we choose
among the set-theoretic notation for conjunctive queries shown above and the
datalog notation, whichever is most convenient to support the presentation.
Conjunctive queries correspond to select-from-where clauses in SQL where
constraints in the where clause only use equality (=) as comparison operator.

Example 2.1.2 The subsumed query from Example 1.3.1 (a conjunctive query)
can be written as a select-from-where query in SQL

select pc, cpu, lname
from pc cpu, belongs to, loc, pc loc
where pc cpu.pc = belongs to.pc
and pc cpu.pc = pc loc.pc
and pc loc.lid = loc.lid
and belongs to.org entity = “Project1”;

or equivalently

q(P c, Cpu, LName) ← pc cpu(P c, Cpu), belongs to(P c, “Project1”),
loc(LId, LName), pc loc(P c, LId)}

in datalog rule notation or

πPc,Cpu,LName (pc cpu ⊲⊳ σOrg Entity=“P roject1′′ (belongs to) ⊲⊳ pc loc ⊲⊳ loc)

as an ALG query. 
30 CHAPTER 2. PRELIMINARIES

Queries with inequality constraints (i.e., 6=, <, ≤, also called arithmetic com-
parison predicates or builtin predicates) are outside of ALG or CALC in principle,
but extensions can be defined without much difficulty2 . A conjunctive query with
inequalities is a clause of the form

q(X̄) ← p1 (X̄1 ), . . . , pn (X̄n ), xi1,1 θ1 xi1,2 , . . . , xim,1 θm xim,2 .

where the xij,k are variables in X̄1 , . . . , X̄n and θj ∈ {6=, <, ≤}.
A datalog program is a set of datalog rules. The dependency graph of a datalog
program P is the directed graph hV, Ei where V is the set of predicate names in
P and E contains an arc from predicate pi to predicate pj iff there is a datalog
rule in P such that pi is its head predicate and pj appears in the body of that
same rule. A datalog program is recursive iff its dependency graph is cyclic.
Positive queries (select-from-where-union queries in SQL) can be written as
nonrecursive datalog programs. Since conjunctive queries are closed under com-
position, all positive queries can also be transformed into equivalent sets of con-
junctive queries (with the head atoms over the same “query” predicate). The size
of these sets can be exponentially larger than the corresponding nonrecursive dat-
alog programs. The process of transforming a nonrecursive datalog program into
a set of conjunctive queries is a form of translating a logical formula in Conjunc-
tive Normal Form (CNF) and is called query unfolding.

Example 2.1.3 The nonrecursive datalog program

q(x, y, z, w) ← a(x, y, z, w).
a(x, y, z, 1) ← b(x, y, z). a(x, y, z, 2) ← b(x, y, z).
b(x, y, 1) ← c(x, y). b(x, y, 2) ← c(x, y).
c(x, 1) ← d(x). c(x, 2) ← d(x).

with 2 ∗ 3 + 1 = 7 rules is equivalent to the following set

q(x, 1, 1, 1) ← d(x). q(x, 1, 1, 2) ← d(x).
q(x, 1, 2, 1) ← d(x). q(x, 1, 2, 2) ← d(x).
q(x, 2, 1, 1) ← d(x). q(x, 2, 1, 2) ← d(x).
q(x, 2, 2, 1) ← d(x). q(x, 2, 2, 2) ← d(x).

of 23 conjunctive queries. 

Relational algebra and calculus are far from representing all computable que-
ries over relational databases. For example, not even the transitive closure of
2
There are, however, a few subtle issues such as the question if the domain is totally ordered
with its impact on data independence [CH80, CH82] that are important for the theory of
queries. Since we will only touch queries with inequalities shortly, we leave this aside.
2.2. QUERY CONTAINMENT 31

binary relations can be expressed using the first-order queries3 . Much has been
said on categories and hierarchies of relational query languages, and examples of
languages strictly more expressive than relational algebra and calculus are, for
instance, datalog with negation (under various semantics) or the while queries.
We refer to [CH82, Cha88, Kan90, AHV95] for more on these issues.
Treatments of complexity and expressiveness of relational query languages can
be found in [Var82, CH82, Cha88, AHV95]. We leave these issues to the related
literature and remark only that the positive relational calculus queries are (data)-
complete in PSPACE [Var82]. The decision problem whether an unfolding of a
conjunctive query with a nonrecursive datalog program (with constants) exists
that uses only certain relational predicates – which is related to the approach to
data integration developed later on in this thesis – is equally PSPACE-complete
and thus presumably a computationally hard problem.

2.2 Query Containment
The problem of deciding whether a query Q1 is contained in a query Q2 (denoted
Q1 ⊆ Q2 ) (possibly under a number of constraints describing a schema) is the
one of deciding whether for any possible databases satisfying the constraints, each
tuple in the result of Q1 is contained in the result of Q2 . Two queries are called
equivalent, denoted Q1 ≡ Q2 , iff Q1 ⊆ Q2 and Q1 ⊇ Q2 .
The containment problem quickly becomes undecidable for expressive query
languages. Already for relational algebra and calculus, the problem is undecid-
able [SY80, Kan90]. In fact, the problem is co-r.e. but not recursive (under the
assumption that databases are finite but the domain is not). Checking the con-
tainment of two queries would require a noncontainment check for every finite
database over dom.
For conjunctive queries, the containment problem is decidable and NP-com-
plete [CM77]. Since queries tend to be small, query containment can be prac-
tically used, for instance in query optimization or data integration [CKPS95,
YL87]. It is usually formalized using the notion of containment mappings (ho-
momorphisms) [CM77].

Definition 2.2.1 Let Q1 and Q2 be two conjunctive queries. A containment
mapping θ is a function from the variables and constants of Q1 into the variables
and constants of Q2 that is

• the identity on the constants of Q1

• Headi (Q2 ) for a variable Headi (Q1 )
3
However, transitive closure can of course be expressed in datalog
32 CHAPTER 2. PRELIMINARIES

• and for which for every atom p(x1 , . . . , xn ) ∈ Body(Q1 ),

p(θ(x1 ), . . . , θ(xn )) ∈ Body(Q2)


It can be shown that for two conjunctive queries Q1 and Q2 , the containment
Q1 ⊆ Q2 holds iff there is a containment mapping from Q2 into Q1 [CM77].

Example 2.2.2 [AHV95] The two conjunctive queries

q1 (x, y, z) ← p(x2 , y1 , z), p(x, y1 , z1 ), p(x1 , y, z1),
p(x, y2 , z2 ), p(x2 , y2, z).

and
q2 (x, y, z) ← p(x2 , y1 , z), p(x, y1 , z1 ), p(x1 , y, z1).
are equivalent. For q1 ⊆ q2 , the containment mapping is the identity. Clearly,
since Body(q2) ⊂ Body(q1), and the heads of the two queries match, q1 ⊆ q2 must
hold. For the other direction, we have θ(x) = x, θ(y) = y, θ(z) = z, θ(x1 ) = x1 ,
θ(y1 ) = y1 , θ(z1 ) = z1 , θ(x2 ) = x2 , θ(y2 ) = y1 , and θ(z2 ) = z1 . 

An alternative way [Ull97] of deciding whether a conjunctive query Q1 is
contained in a second, Q2 , is to freeze the variables of Q1 into new constants (i.e.,
which do not appear in the two queries) and to evaluate Q2 on the canonical
database created from the frozen body atoms of Q1 . Q1 is then contained in Q2
if and only if the frozen head of Q1 appears in the result of Q2 over the canonical
database.

Example 2.2.3 Consider again the two queries of Example 2.2.2. The canon-
ical database for q2 is I = {p(ax2 , ay1 , az ), p(ax , ay1 , az1 ), p(ax1 , ay , az1 )} where
ax , ay , az , ax1 , ay1 , az1 , ax2 are constants. We have

q1 (I) = {hax2 , ay1 , az i, hax2 , ay1 , az1 i, hax , ay1 , az i, hax , ay1 , az1 i,
hax , ay , az i, hax , ay , az1 i, hax1 , ay1 , az1 i, hax1 , ay , az1 i}

Since the frozen head of q2 is hax , ay , az i and hax , ay , az i ∈ q1 (I), q2 is contained
in q1 . 

The containment of positive queries Q1 , Q2 can be checked by transforming
them into sets of conjunctive queries Q′1 , Q′2 . Q′1 is of course contained in Q′2 iff
each member query of Q′1 is individually contained in a member query of Q′2 .
2.3. DEPENDENCIES 33

Bibliographic Notes
The containment problem for conjunctive queries is NP-complete, as mentioned.
The problem can be efficiently solved for two queries if neither query contains
more than two atoms of the same relational predicate [Sar91]. In that case, a
very efficient algorithm exists that runs in time linear in the size of the queries.
Another polynomial-complexity case is encountered when the so-called hypergraph
of the query to be tested for subsumption is acyclic [YO79, FMU82, AHV95]. For
that class of queries, the technique of Example 2.2.3 can be combined with the
polynomial expression complexity of the candidate subsumer query.
If arithmetic comparison predicates4 are permitted in conjunctive queries
[Klu88], the complexity of checking query containment is harder and jumps to the
second level of the polynomial hierarchy [vdM92]. The containment of datalog
queries is undecidable [Shm87]. This remains true even for some very restricted
classes of single-rule programs (sirups) [Kan90]. Containment of a conjunctive
query in a datalog query is EXPTIME-complete – this problem can be solved with
the method of Example 2.2.3, but then consumes the full expression complexity
of datalog [Var82] (i.e., EXPTIME). The opposite direction, i.e. containment of
a datalog program in a conjunctive query, is still decidable but highly intractable
(it is 2-EXPTIME-complete [CV92, CV94, CV97]).
Other interesting recent work has been on the containment of so-called regular
path queries – which have found much research interest in the field of semistruc-
tured databases – under constraints [CDL98a] and on containment of a class of
queries over databases with complex objects [LS97] (see also Section 2.5).

2.3 Dependencies
Dependencies are used in database design to add semantics and integrity con-
straints to a schema, which database instances have to comply to. Two particu-
larly important classes of dependencies are functional dependencies (abbreviated
fd’s) and inclusion dependencies (ind’s).
A functional dependency R : X → Y over a relational predicate R (where X
and Y are sets of attribute names of R5 ) has the following semantics. It enforces
that for each relation instance over R, for each pair t1 , t2 of tuples in the instance,
if for each attribute name in X the values in t1 and t2 are pairwise equal, then
the values for the attributes in Y must be equal as well.
Primary keys are special cases of functional dependencies where X∪Y contains
all attributes of R.
4
Such queries satisfy the real-worl need of asking queries where an attribute is to be, for
instance, of value greater than a certain constant.
5
Under the unnamed perspective sufficient for conjunctive queries in datalog notation, we
will refer to the i-th attribute position in R by $i, instead of an attribute name.
34 CHAPTER 2. PRELIMINARIES

Example 2.3.1 Let R ⊆ dom3 be a tertiary relation with two functional de-
pendencies R : $1 → $2 $3 (i.e., the first attribute is a primary key for R) and
R : $3 → $2. Consider an instance I = {h1, 2, 3i}. The attempt to insert a new
tuple h1, 2, 4i into R would violate the first fd, while the attempt to do the same
for h5, 6, 3i would violate the second. 

Informally, inclusion dependencies are containment relationships between que-
ries of the form πγ (R), i.e., attributes of a single relation R may be reordered or
projected out. Foreign key constraints, which require that a foreign key stored in
one tuple must also exist in the key attribute position of some tuple of a usually
different relation, are inclusion dependencies.
Dependencies as database semantics, notably, are valuable in query optimiza-
tion and allow to enforce the integrity of database updates.

2.4 Global Query Optimization
Modern database systems rely on the idea of a separation of physical and logical
schemata in order to simplify their use [TK78, AHV95]. This, together with the
declarative flavor of many query languages, leads to the need to optimize queries
such that they execute quickly.
In the general case of the relational queries (i.e., ALG or the relational calcu-
lus), global optimization is not computable. For conjunctive queries, and on the
logical level, where physical cost-based metrics can be left out of consideration,
though, global optimality (that is, minimality) can be achieved. A conjunctive
query Q is minimal if there is no equivalent conjunctive query Q′ s.t. Q′ has fewer
atoms (subgoals) in its body than Q.
This notion of optimality is justified because joins of relations are usually
among the most expensive relational (algebra) operations carried out by a rela-
tional database system during query execution. Minimality is of interest in data
integration as well.
Computing a minimal equivalent conjunctive query is strongly related to the
query containment problem (see Section 2.2). The associated decision problem
is again NP-complete. Minimal queries can be computed using the following fact
[CM77]. Given a conjunctive query Q, there is a minimal query Q′ (with Q ≡ Q′ )
s.t. Head(Q) = Head(Q′ ) and Body(Q′ ) ⊆ Body(Q), i.e. the heads are equal
and the body of Q′ contains a subset of the subgoals of Q, without any changes
to variables or constants. Conjunctive queries can thus be optimized by checking
all queries created by dropping body atoms from Q while preserving equivalence
and searching for the smallest such query.

Example 2.4.1 Take the queries q1 and q2 from Example 2.2.2. By checking all
subsets of Body(q2), it can be seen that q2 is already minimal. In fact, q2 is also
2.5. COMPLEX VALUES AND OBJECT IDENTITIES 35

a minimal query for q1 , as Body(q2 ) is the smallest subset of Body(q1) such that
q2 and q1 remain equivalent. 

Global optimization of conjunctive queries under a number of dependencies
(e.g., fd’s) can be carried out using a folklore technique called the chase [ABU79,
MMS79], for which we refer to the literature (see also [AHV95]).

2.5 Complex Values and Object Identities
Among the principal additional features of the object-oriented data model [BM93,
Kim95, CBB+ 97], compared to the relational model, we have object identifiers,
objects that have complex (“nested”) values, IS-A hierarchies, and behavior at-
tributed to classes of objects, usually via (mostly) imperative methods. For the
purpose of querying and data integration under the object-oriented data model,
the notions of object identifiers and complex objects deserve some consideration.
Research on complex values in database theory has started by giving up the
requirement that values in relations may only contain atomic values of the domain
(non-first normal form databases). The complex value model, theoretically very
elegant, is strictly a generalization of the relational data model. Values are created
inductively from set and tuple constructors. The relational data model is thus
the special case of the complex value model where each relation is a set of tuples
over the domain. For instance,

{hA : dom, B : dom, C : {hA : dom, B : {dom}i}i}

is a valid sort in the complex value model and

{ha, b, {hc, {}i, hd, {e, g}i}i, he, f, {}i}

is a value of this sort, where a, b, c, d, e, f , g are constants of dom. As
for the relational data model, algebra and calculus-based query languages can
be specified, and equivalences be established. Informally, in the algebraic per-
spective, set-based operations (union, intersection and difference), which are re-
quired to operate over sets of the same sorts, and simple tuple-based operations
(such as projection) known from the relational model are extended by a more ex-
pressive selection operation, which may have conditions such as set membership
and equality of complex values, and the powerset operation, furthermore tuple-
and set-creation and destruction operations (see [AHV95]). Other operations
such as renaming, join, and nesting and unnesting can be defined from these.
The complex-value algebra (ALGcv ) has hyperexponential complexity. When the
powerset operation is replaced by nesting and unnesting operations, we arrive
at the so-called nested relation algebra ALGcv− . All queries in ALGcv− can be
36 CHAPTER 2. PRELIMINARIES

executed efficiently (relative to the size of the data), which has motivated com-
mercial object-oriented database systems such as O2 [LRV88] and standards such
as ODMG’s OQL [CBB+ 97] to closely adopt it.
Interestingly, it can be shown that all ALGcv− queries over relational databases
have equivalent relational queries [AB88, AHV95]. This is due to the fact that
unnested values in a tuple always represent keys for the nested tuples; nestings
are thus purely cosmetic.
Furthermore, every complex value database can be transformed (in polyno-
mial time relative to the size of the complex value database) into a relational
one [AHV95] (This, however, requires keys that identify nested tuples as objects,
i.e., object identifiers). The nested relation model - and with it a large class
of object-oriented queries - is thus just “syntactic sugaring” over the relational
data model with keys as supplements for object identifiers. From the query-only
standpoint of data integration, where structural integration can take care of in-
venting object identifiers in the canonical transformation between data models,
we can thus develop techniques in terms of relational queries, which can then be
straightforwardly applied to object-oriented databases as well6 .
We also make a comment on the calculus perspective. Differently from the
relational model, in the complex value calculus CALCcv variables may represent
and be quantified over complex values. We are thus operating in a high-order
predicate calculus with a finite model semantics. The generalization of range
restriction (called safe-range calculus) for the relational calculus to the complex
value calculus is straightforward but verbose (see [AHV95]). It can be shown
that ALGcv and the safe-range calculus CALCcv (which represents exactly the
domain independent complex value calculus queries) are equivalent. Furthermore,
if set inclusion is disallowed but set membership as the analog of nesting remains
permitted, the so-called strongly safe-range calculus CALCcv− is attained, which
is equivalent to ALGcv− .
Conjunctive nested relation algebra – in which set union and difference have
been removed from ALGcv− – is thus equivalent to the conjunctive relational
queries.

Example 2.5.1 Consider an instance Parts, which is a set of complex values of
the following sort. A part (in a product-data management system) is a tuple of a
barcode B, a name N, and a set of characteristics C. A characteristic is a tuple
of a name N and a set of data elements D. A data element is a tuple of a name
N, a unit of measurement U, and a value V 7 . The sort can be thus written as

hB : dom, N : dom, C : {hN : dom, D : {hN : dom, U : dom, V : domi}i}i
6
Some support for object-oriented databases is a requirement in the use case of Section 1.3.
7
For simplicity, we assume that all atomic values are of the same domain dom. This is not
an actual restriction unless arithmetic comparison operators (<, ≤) are allowed in the query
language.
2.5. COMPLEX VALUES AND OBJECT IDENTITIES 37

Suppose now that we ask the following query in nested relation algebra ALGcv− :

πN,B,D (unnestC (πB,C (Parts)))

which asks for transformed complex values of sort

hN : dom, B : dom, D : {hN : dom, U : dom, V : domi}i

and can be formulated in strongly safe-range calculus CALCcv− as

{x : hN, B, D : {hN, U, V i}i | ∃y, z, z ′ , w, w ′, u, u′ :
y : hB, N, C : {hN, D : {hN, U, V i}i}i ∧
z : {hN, D, {hN, U, V i}i} ∧
z ′ : hN, D, {hN, U, V i}i ∧
w : {hN, U, V i} ∧ w ′ : hN, U, V i ∧
u : {hN, U, V i} ∧ u′ : hN, U, V i ∧
x.B = y.B ∧ y.C = z ∧ z ′ ∈ z ∧
z ′ .N = x.N ∧ z ′ .D = w ∧ w ′ ∈ w ∧
x.D = u ∧ u′ ∈ u ∧ u′ = w ′ }

Let us map the collection Parts to a flat relational database with schema

R = {Part(P oid, B, N), Char(Coid, N, P oid), DataElement(N, U, V, Coid)}

where the attributes P oid and Coid stand for object identifiers which must be
invented when flattening the data. The above query can now be equivalently
asked in relational algebra as

πN,B,Dn,U,V ((πP oid,B (Part) ⊲⊳ Char) ⊲⊳ πN →Dn,U,V,Coid(DataElement))

The greatest challenge here is the elimination or renaming of the three name
attributes N. The same query has the following equivalent in the (conjunctive)
relational calculus

{hx, y, z, u, vi | ∃i1 , i2 , d : Part(i1 , x, d) ∧ Char(i2 , y, i1) ∧ DataElement(z, u, v, i2 )}

After executing the query, the results can be nested to get the correct result
for the nested relational algebra or calculus query. 
38 CHAPTER 2. PRELIMINARIES
Chapter 3

Data Integration

This chapter briefly surveys several research areas related to data integration.
We proceed by first presenting two established architectures, federated and mul-
tidatabases in Section 3.2 and data warehouses in Section 3.3. Next, in Sec-
tion 3.4, we discuss information integration in AI. Several research areas of AI
that are relevant to this thesis are surveyed, including ontology-based global in-
formation systems, capability description and planning, and multi-agent systems
as a further integration architecture. Then we discuss global-as-view integra-
tion (together with an integration architecture, mediator systems) in Section 3.5
and local-as-view integration in Section 3.6. In Sections 3.7 and 3.8 we arrive
at recent data integration approaches. Section 3.9 discusses management and
maintainability issues in large and evolving data integration systems and com-
pares the different approaches presented according to various qualitative aspects.
First, however, we start with some definitions.

3.1 Definitions and Overview
Source integration [JLVV00] refers to the process of integrating a number of
sources (e.g. databases) into one greater common entity. The term is usually
used as part of a greater, more encompassing process, as perceived in the data
warehousing setting, where source integration is usually followed by aggregation
and online analytical processing (OLAP). There are two forms of source inte-
gration, schema integration and data integration. Schema integration [BLN86]
refers to a software engineering or knowledge engineering approach, the process
of reverse-engineering information systems and reengineering schemata in order
to obtain a single common “integrated” schema – which we will not address in
more detail in this thesis. While the terms data and information are of course
not to be confused, data integration and information integration are normally
used synonymously (e.g., [Wie96, Wie92]).
Data integration is the area of research that addresses problems related to

39
40 CHAPTER 3. DATA INTEGRATION

schema source
integration integration

data
reconciliation
data
integration
structural
semantic integration
integration

Figure 3.1: Artist’s impression of source integration.

the provision of interoperability to information systems by the resolution of het-
erogeneity between systems on the level of data. This distinguishes the problem
from the wider aim of cooperative information systems [Coo], where also more
advanced concepts such as workflows, business processes, and supply chains come
into play, and where problems related to coordination and collaboration of sub-
systems are studied which go beyond the techniques required and justified for the
integration of data alone.
The data integration problem can be decomposed into several subproblems.
Structural integration (e.g., wrapping [GK94, RS97]) is concerned with the res-
olution of structural heterogeneity, i.e. the heterogeneity of data models, query
and data access languages, and protocols1 . This problem is particularly inter-
esting when it comes to legacy systems, which are systems that in general have
some aspect that would be changed in an ideal world but in practice cannot be
[AS99]. In practice, this often refers to out-of-date systems in which parts of the
code base or subsystems cannot be adapted to new requirements and technologies
because they are no longer understood by the current maintainers or because the
source code has been lost.
Semantic integration refers to the resolution of semantic mismatch between
schemata. Mismatch of concepts appearing in such schemata may be due to a
number of reasons (see e.g. [GMPQ+ 97]), and may be a consequence of differ-
ences in conceptualizations in the minds of different knowledge engineers. Mis-
1
We experience structural heterogeneity if we need to make a number of databases interop-
erable of which, for example, some are relational and others object-oriented, or if among the
relational databases some are only queryable using SQL while others are only queryable using
QUEL [SHWK76]. Other kinds of structural heterogeneity are encountered when two database
systems use different models for managing transactions or lack middleware compatible with
both which allows to communicate queries and results.
3.2. FEDERATED AND MULTIDATABASES 41

match may not only occur on the level of schema entities (relations in a relational
database or classes in an object-oriented system), but also on the level of data.
The associated problem, called data reconciliation [JLVV00], includes object iden-
tification (i.e., the problem of determining correspondences of objects represented
by different heterogeneous data sources) and the handling of mistakes that hap-
pened during the acquisition of data (e.g. typos), which is usually referred to as
data cleaning. An overview of this classification of source integration is given in
Figure 3.1.
Since for this thesis, the main problem among those discussed in this section
is the resolution of semantic mismatch, we will also put an emphasis on this
problem in the following discussion and comparison of research related to data
integration.

3.2 Federated and Multidatabases
The data integration problem has been addressed early on by work on multi-
database systems. Multidatabase systems are collections of several (distributed)
databases that may be heterogeneous and need to share and exchange data. Ac-
cording to the classification2 of [SL90], federated database systems [HM85] are a
subclass of multidatabase systems. Federated databases are collections of col-
laborating but autonomous component database systems. Nonfederated multi-
database systems, on the other hand, may have several heterogeneous schemata
but lack any other kind of autonomy. Nonfederated multidatabase systems have
one level of management only and all data management operations are performed
uniformly for all component databases. Federated database systems can be cat-
egorized as loosely or tightly coupled systems. Tightly coupled systems are ad-
ministrated as one common entity, while in loosely coupled systems, this is not
the case and component databases are administered independently [SL90].
Component databases of a federated system may be autonomous in several
senses. Design autonomy permits the creators of component databases to make
their own design choices with respect to representation, i.e. data models and
query languages, data managed and schemata used for managing them, and the
conceptualizations and semantic interpretations of the data applied. Other kinds
of component autonomy that are of less interest to this thesis but still deserve to
be mentioned are communication autonomy, execution autonomy and association
autonomy [SL90, HM85]. Autonomy is often in conflict with the need for sharing
data within a federated database system. Thus, one or several kinds of autonomy
may have to be relaxed in practice to be able to provide interoperability.
2
There is some heterogeneity in the nomenclature of this area. A cautionary note is due at
this point: Many of the terms in this chapter have been used heterogeneously by the research
community. Certain choices had to be made in this thesis to allow a uniform presentation,
which are hopefully well documented.
42 CHAPTER 3. DATA INTEGRATION

External External ... External
Schema Schema Schema

Federated Federated
Schema
... Schema

Export Export ... Export
Schema Schema Schema

Component Component
Schema ... Schema

Local ... Local
Schema Schema

Figure 3.2: Federated 5-layer schema architecture

Modern database systems successfully use a three-tier architecture [TK78]
which separates physical (also called internal) from logical representation and
the logical schema in turn from possibly multiple user or application perspectives
(provided by views). In federated database systems, these three layers are con-
sidered insufficient, and a five-layer schema architecture has been proposed (e.g.
[SL90] and Figure 3.2). Under this architecture, there are five types of schemata
between which queries are translated. These five types of schemata are

• Local schemata. The local schema of a component database corresponds to
the logical schema in the classical three-layered architecture of centralized
database systems.

• Component schemata. The component schema of a database is a version of
its local schema translated into the data model and representation formal-
ism shared across the federated database system.

• Export schemata. An export schema contains only the part of the schema
relevant to one integrated federated schema.

• Federated schemata 3 . This schema is an integrated homogeneous view of
the federation, against which a number of export schemata are mapped
(using data integration technology). There may be several such federated
schemata inside a federation, providing different integrated views of the
available data.
3
These are also known as import schemata or global schemata [SL90].
3.3. DATA WAREHOUSING 43

"Data Cube"
Data Marts Data
(MDDBS)
Analysis

Extraction &
Aggregation
Data
Warehouse

Data
Mediator Reconciliation &
Integration

Wrapper Wrapper Wrapper Wrapper Wrapper

Figure 3.3: Data warehousing architecture and process.

• External Schemata provide application or user-specific views of the feder-
ated schemata, as in the classical three-layer architecture.

This five-layer architecture is believed to provide better support for the inte-
gration and management of heterogeneous autonomous databases than the clas-
sical three-layer architecture [HM85, SL90].

3.3 Data Warehousing
Data Warehousing (Figure 3.3) is a somewhat interdisciplinary area of research
whose scope goes beyond pure data integration. The goal is usually, in an en-
terprise environment, to collect data from a number of distributed sites4 (e.g.,
grocery stores), clean and integrate them, and put them into one large central
store, the corporate data warehouse. Data warehousing is also about performing
aggregation of relevant data (e.g. sales data). Data may then be extracted and
transformed according to schemata customized for particular users or analysis
tools (Online Analytical Processing, OLAP) [JLVV00].
Since the data manipulated are in practice often highly mission-critical to
enterprises and may be very large, special technologies have been developed for
4
The point of this is not just the resolution of heterogeneity but also to to have distinct sys-
tems for Online Transaction Processing (OLTP) and data analysis for decision support, which
usually access data in very different ways and also need differently optimized schemata. (In
OLTP, transactions are usually short and occur at a high density, while in OLAP, transactions
are few but long and put emphasis on querying.)
44 CHAPTER 3. DATA INTEGRATION

dealing with aggregation of data (e.g. the summarization of sales data according
to criteria such as categories of products sold, regions, and time spans), such as
multidimensional databases (MDDBMS) or data cubes.
As data integrated against a warehouse are usually materialized there, the
data warehousing literature often makes a distinction between mediation, which is
confined to data integration on demand, i.e. when a query against the warehouse
occurs (also called “virtual” integration or the lazy approach [Wid96] by data
warehouse researchers), and materialized data integration (the eager approach
[Wid96]). The materialized approach to data integration in fact adds problems
related to dynamic aspects (e.g., the view update and view maintenance prob-
lems). These problems are not yet well understood, and known theoretical results
are often quite negative [AHV95].
Data Warehousing has received considerable interest in industry, and there
are several commercial implementations, such as those by Informix and MicroS-
trategy [JLVV00]. Two well-known research systems are WHIPS [GMLY98] and
SQUIRREL [ZHKF95b, ZHKF95a].

3.4 Information Integration in AI
There has traditionally been much cross-fertilization between the artificial intel-
ligence and information systems areas, and the intelligent integration of infor-
mation [Wie96] is not an exception. It is particularly worthwhile to take note of
research on ontologies, capability description, planning, knowledge-based systems,
and multi-agent systems. Another important area are description logics, which
we leave to their own section (Section 3.7). Work in these areas has – sometimes
indirectly – had much influence on data integration.

3.4.1 Integration against Ontologies
There is an ongoing discussion among Formal Ontologists and AI researchers
on how to define ontologies [GN87, Gru, GG95, Gua94, HS97]. One definition
that has been particularly well argued for refers to ontologies as partial accounts
of specifications of conceptualizations [GG95]. Ontologies are logical theories of
parts of conceptualizations (to be found in the mind of some knowledge engineer)
of a problem domain. As such, ontologies may consist of more than taxonomi-
cal knowledge but include virtually any kind of knowledge. In practice we are
interested in work on ontologies in the context of information integration as in-
formation models of AI information systems, powerful forms of schemata.
Ontological engineering [Gua97, Gru92, Gru93a, Gru93b, CTP00] concerns it-
self with the design and maintenance of large ontologies. Several research projects
on tools [DSW+ 99] for ontological engineering, such as the Ontolingua server
[FFR96], have been carried out. One problem also addressed is the one of reengi-
3.4. INFORMATION INTEGRATION IN AI 45

neering and merging existing ontologies, which is in many ways similar to schema
integration [BLN86]. Experiences show much similarity with developments in
object-oriented software engineering and information systems research. Design-
ing and maintaining large ontologies has been found to lead to problems (see the
Cyc experience [LGP+ 90]), and research has followed approaches such as apply-
ing the idea of design patterns [GHJV94] to ontological engineering [CTP00], or
the use of libraries of micro-ontologies, which are small building blocks that can
be composed to create domain ontologies on demand.
AI data integration systems are usually based on an architecture in which
there is one well-designed “global” domain ontology (as a theory of the world
represented) against which a number of wrapped data sources are integrated.
Such systems fall into the category of global information systems For instance, the
influential Carnot system [SCH+ 97] of MCC mapped databases against the large
and well-known Cyc ontology [LGP+ 90, Cyc] using a deductive database language
called LDL [Zan96]. For other similar interesting work see e.g. the OBSERVER
project [MKSI96, MIKS00], SIMS [AK92, HK93, AAA+ 97] and InfoSleuth [NU97,
NBN99, BBB+ 97, FNPB99, NPU98].
It has been claimed (e.g. in [MIKS00]) that global information systems based
on ontologies are a substantial step forward compared to systems that integrate
against database schemata because ontologies allow to describe information con-
tent in data repositories independently of the underlying syntactic representation
of the data. The rationale behind this is that ontologies are defined as artifacts
on the knowledge level [New82, New93] rather than the symbol level and should
be independent of syntactic considerations. The above claim of a practical advan-
tage can be comfortably challenged, however. Apart from the necessary choice
of some vocabulary for naming the concepts, ontological commitments have to
be made on how to interrelate concepts (onto)logically (e.g. by part-of, is-a, and
instance-of relationships) as much as they are needed in database schema de-
sign. Research in Formal Ontology such as [Bra83, GW00b, GW00a] aims at
determining guidelines for ontological commitments. It is highly questionable if
such work could ever keep humans from intuitively disagreeing on such issues.
However, until such consensus is reached, it would be misleading to make the
above claim in the pragmatic context of data integration. Note also that the
OBSERVER system of [MIKS00] uses the CLASSIC description logics system
for representing ontologies, a system that is even considered by its designers to
provide a symbol-level data model [BBMR89a, Bor95] (see also Section 3.7).

3.4.2 Capability Descriptions and Planning
Planning as a particularly important application of problem solving has been
among the core topics of interest in Artificial Intelligence ever since the influential
STRIPS planning system [FN71] established it as a research area in its own right,
46 CHAPTER 3. DATA INTEGRATION

with its own special theoretical results and algorithms5 [Wel99].
Planning problems in STRIPS-like planners are described by an initial state of
the world, a goal state, and a number of planning operators (“actions”), described
by pre- and postconditions and invariants6 . A solution to a planning problem is
then a (possibly only partially ordered) sequence of operator applications that
transforms the world from the given initial state to the desired goal state.
The need for capability description, which is strongly related to such operator
descriptions, in systems that use planning has resulted in a number of interesting
capability description languages [WT98], e.g. LARKS, the capability description
formalism of Retsina [SLK98], description-logics based formalisms [BD99], and
capability description languages for problem-solving methods in knowledge-based
systems (e.g. EXPECT [SGV99]).
Planning for information gathering has received much recent interest because
of its role in intelligent information systems for dealing with the information
overload of the World Wide Web [Etz96, Mae94]. Since planning for information
gathering is a quite special case of planning in general (for instance, information
gathering operations do not change the world in the sense actions in a physical
world do), special techniques have been developed for this problem [KW96, AK92,
GEW96].
The data integration problem can be formulated as a planning problem as well,
with reasoning being done based on the capability descriptions of data sources.
Interestingly, this leads to mappings between data sources and global ontologies
that is the inverse of the classical method of, for instance, Carnot, or that con-
ventionally used in federated and multidatabase systems, data warehouses, and
mediator systems. In the classical method, “destination” concepts that are part
of the “global” integration schemata are described as views over the data sources
(conceptually speaking; in practice, these mappings are often encoded as some
procedural transformation code that does the job). This conventional method of
data integration is thus termed global-as-view (GAV) integration.
Data integration by planning on the other hand proceeds by having contents of
data sources described as capabilities in terms of the global world model. Queries
are answered by building a plan that uses the given data sources as described
in the capability descriptions to extract and combine their data, and executing
it. This kind of data integration, where mappings are expressed as descriptions
of “local” sources in terms of the global ontology, is thus called local-as-view
(LAV) integration7 . Notable AI research that follows this route includes the
OCCAM planning algorithm [KW96] and the SIMS system [AK92, HK93] for
dealing with heterogeneous information sources, which is based on the LOOM
5
Consider, for instance, partial-order planners [PW92, RN95] and more recently SAT plan-
ning [KS92] and Graphplan [BF97].
6
In STRIPS, operators were described by preconditions and so-called add- and delete-lists
for logical statements about the world that are changed by executing an action
7
We will address the GAV and LAV issue in more depth in dedicated sections, 3.6 and 3.5.
3.4. INFORMATION INTEGRATION IN AI 47

knowledge representation and reasoning system [MB87] (using its description
logic for expressing contents of data sources) and the Prodigy planner [CKM91].

3.4.3 Multi-agent Systems
Multi-agent systems (MAS) are, by their very conceptualization, cooperative in-
formation systems par excellence. We avoid touching the unsettled issue of trying
to define software agents here and refer to [Nwa96, WJ95, Wei99] or the exten-
sive community discussion of that issue in the UMBC agents mailing list archives
[Age]. MAS for information integration follow a heavy agent metaphor, in which
agents

• have an explicit logical model of their environment and other agents.

• need to reason over their knowledge and over the states of other agents.

• need to plan, both for information gathering (i.e., as a part of the data
integration problem analogous to query rewriting) and possibly for multi-
agent coordination (e.g. Partial Global Planning [DL91], GPGP [DL92,
DL95]).

• communicate in expressive agent communication languages. These usually
provide elementary building blocks8 for protocols and knowledge exchange
formats (e.g. KIF [GF92]).

Furthermore, agents in the information integration setting are usually de-
signed to be cooperative rather than self-interested [SL95].
Apart from being a welcome testbed and melting pot for various areas of AI
research, the field has its own interesting and still largely unresolved challenges.
The coordination problem in MAS revolves around much more than just provid-
ing languages for communication and knowledge exchange. The collaboration of
agents requires coordination whose provision is not yet sufficiently understood.
Much research has centered around providing coordination algorithms and pro-
tocols (e.g. the Contract Net Protocol [Smi80] and (Generalized) Partial Global
Planning [DL91, DL92, DL95]), research frameworks (e.g. TÆMS [Dec95]), ab-
stractions of protocols (e.g. conversation policies [SCB+ 98, GHB99]), social rules
[COZ00], pragmatics [HGB99], and game-theoretic considerations [PWC95]. For
further interesting work on coordination see [WBLX00, COZ00, Cro94, KJ99].
Another important problem is to establish multi-agent systems as a soft-
ware engineering paradigm – agent-oriented software engineering [Sho93, Jen99,
JW00].
8
These building blocks are sometimes called performatives and at times have been motivated
by speech act theory [Sea69], as in the case of KQML [FFMM94, FL97].
48 CHAPTER 3. DATA INTEGRATION

User
Agent
9 1
10 8
2 Match User
3 maker Agent
13 14
Wrapper 11 1
6
Agent 12 7
8
Data Wrapper
4
Analysis Agent
5 Mediator 4
Agent
6 5
Data 7 3
Analysis 2
Agent
Wrapper
Wrapper Agent
Agent

Figure 3.4: MAS architectures for the intelligent integration of information. Ar-
rows between agents depict exemplary communication flows. Numbers denote
logical time stamps of communication flows.

Intelligent Information Integration has been a popular application of MAS.
Due to their approach of seeking interoperability of several highly autonomous
units (the agents), multi-agent systems are almost by definition performing an
integration task. Several systems thus have addressed information integration,
e.g. Retsina [SLK98], InfoSleuth [NU97, NBN99, BBB+ 97, FNPB99, NPU98],
KRAFT [PHG+ 99] and BOND [TBM99]. Such systems are particularly interest-
ing for their contributions to structural integration9 and have been less ground-
breaking with respect to semantic integration where usually techniques in the
tradition of those discussed elsewhere in this chapter are used10 .
A generic MAS architecture for information integration is depicted in Fig-
ure 3.4. Such cooperative multi-agent systems are networks of collaborating
agents of a number of categories, some of which we list next.

• Wrapper agents connect data sources (possibly legacy systems) to the sys-
9
In principle they constitute the promise of the most open, hot-pluggable middleware in-
frastructure possible.
10
Surprisingly, systems such as InfoSleuth and KRAFT follow the global-as-view paradigm
for integration as planning is not employed on the level of data integration as is the case in
SIMS and OCCAM.
3.4. INFORMATION INTEGRATION IN AI 49

tem by advertising contents to other agents and listening for and answering
data gathering requests of other agents on behalf of the wrapped sources.
• Middle agents [DSW97, GK94] or facilitators aim at solving the connection
problem [DS83], i.e., the problem of enabling providers and requesters in a
multi-agent system to initially meet. Middle agents support interoperability
and cooperation by matching agents with others that may be helpful in
solving their integration problems. Such agents may have varying degrees
of “intelligence” and proactivity, and one notably distinguishes between
matchmakers, brokers and mediators.
Matchmakers are advanced yellow pages services with varying degrees of
sophistication that allow agents to advertise their services as well as to
inquire for services of other agents (e.g. [SLK98]). Broker agents [RZA95]
can be explained as analogous to real-life stock market or real estate brokers.
Brokers solve the connection problem by matching agents, but may (and
usually do) also act as intermediaries in the subsequent problem solving
process. This may, for instance, allow agents communicating via a broker
to remain anonymous. Mediators (e.g. [ABD+ 96]) add additional value
by acting as intermediaries between agents collaborating to achieve some
common goal and employing their own capabilities to support the problem
solving process. More precisely, mediators do not only attack the connection
problem on the level of finding matches, but often also resolve semantic
heterogeneity between agents in a heterogeneous system.
Note, however, that there is substantial terminological heterogeneity re-
garding this issue. Particularly facilitators called brokers have had differ-
ent roles from the one described above in some systems for information
integration [NBN99, PHG+ 99].
• Data analysis and processing agents provide some value-adding reasoning
functionality to the other agents in the system.
• User agents represent the interests of users and gather information from the
system on their behalf.
In Figure 3.4, arrows between agents depict two exemplary communication
flows, one involving a matchmaker and one involving a mediator agent. The
arrows describe the directions of messages sent, and are attributed with logical
time stamps. The main difference between the two types of middle agents that
this figure is meant to clarify is that matchmakers may be consulted for services
but requester agents are then left to themselves for the problem solving task, while
mediators are usually highly involved throughout this process. The matchmaker
of Figure 3.4 checks back whether the agents that it plans to propose to the
requester are able to provide the requested service. This goes beyond a simple
yellow pages service.
50 CHAPTER 3. DATA INTEGRATION

For influential work on multi-agent systems architectures following the heavy
agent metaphor in general, we refer to ARCHON [CJ96, JCL+ 96], the Retsina
infrastructure [SPVG01], and KAoS [BDBW97]. In conclusion, it is necessary
to remark that MAS as cooperative information systems have gone much further
than just to information integration, for instance to managing and integrating
the business processes and supply chains of enterprises [PGR98, JNF98, JFJ+ 96,
JFN+ 00].

3.5 Global-as-view Integration
The global-as-view way by which mappings between schemata are defined – by
describing “global” integrated schemata in terms of the sources11 – has been
used in most of the architectures discussed so far. This includes multidatabase
systems, the data warehouse architecture where we had one component called
the “mediator” which performed data integration, and various AI approaches as
discussed in Section 3.4.
In this section, we will first discuss the mediator architecture of [Wie92], which
has been seminal to information systems research12 . Then, we approach global-
as-view integration in a simplistic way, through classical database views (one
may expect, however, that this is the approach taken most often in industrial
practice). Finally, we briefly discuss some research systems related to this area.

3.5.1 Mediation
Mediators are components of an information system that address a particular
heterogeneity problem in the system and provide a pragmatic “solution” to it. A
mediator is a “black box” that assumes a number of sources with some exported
schema each (these can be, for instance, wrapped databases or other mediators).
Mediators export some interface (some schema) against which data are integrated.
The integration problem is then left to a domain expert to address a certain
aspect of heterogeneity and to implement the mediator. Each mediator thus
encapsulates a particular integration problem. An overview of types of integration
problems (“mediation functions”) is given in [Wie92]. Such mediation functions
include the transformation and subsetting of databases, the merging of multiple
11
This is also the method in which any procedural code that transforms data adhering to one
schema to another one.
12
Note that the term mediation has experienced substantial overload, and we have used it so
far in three different contexts and with four slightly different meanings. Differently from the
mediator concept in our data warehouse architecture [JLVV00], this fourth mediator concept
has a smaller granularity. Differently from mediator agents in AI, Wiederhold’s mediators
are far remote from aspects of multi-agent cooperation and are not meant to be “intelligent”
[Wie92]. Mediation as the “lazy” approach to data integration mentioned earlier in the context
of data warehousing closely coincides with this fourth concept.
3.5. GLOBAL-AS-VIEW INTEGRATION 51

Mediator Mediator

Mediator Mediator

Mediator Mediator Mediator

Wrapper Wrapper Wrapper Wrapper Wrapper

Figure 3.5: A mediator architecture

heterogeneous databases, the abstraction and generalization of data, and methods
for dealing with uncertain data as well as incomplete or mismatched sources.
A typical architecture of a mediator system is shown in Figure 3.5. For struc-
tural integration, data sources are usually wrapped to permit a single way of
accessing sources in terms of data models and query languages. Mediators are
pieces of code encapsulating some operational knowledge of a domain expert,
implementing mediation functions that add value to and remove heterogeneity
from the data provided by the sources.

3.5.2 Integration by Database Views
Let us assume a relational database context. In the global-as-view approach,
global relations are expressed as views (e.g. SQL views) in terms of source rela-
tions. Given a global relation p(X̄) and sources p1 , . . . , pn , p might be expressed
as a (finite) set of conjunctive views13 p(X̄) ← p1 (X̄1 ), . . . , pn (X̄n ).
Given a query posed in terms of view predicates, the query answering process
is simple, as it reduces to simple conjunctive query unfolding (see Section 2.1).

Example 3.5.1 Suppose we have four sources of information about books.

acm proceedings(T itle, ISBN)
13
For the record, such a view is logically equivalent to a declarative constraint of the form

{hX̄i | p(X̄)} ⊇ {hX̄i | ∃Ȳ : p1 (X̄1 ) ∧ . . . ∧ pn (X̄n )}

in set-theoretic notation where X̄, Ȳ are tuples of variables, X̄1 , . . . , X̄n are tuples of variables
and constants, and Ȳ = (X̄1 ∪ . . . ∪ X̄n ) − X̄.
52 CHAPTER 3. DATA INTEGRATION

book1(ISBN, T itle, Author, P ublisher)
product(Name, Category, P roducer, P rice)
book price(ISBN, P rice)

We can create a positive database view providing an integrated interface to book
information as follows (Let us assume we are only interested in titles, publishers
and prices of books).

book(T itle, P ublisher, P rice) ←
book1(ISBN, T itle, Author, P ublisher),
book price(ISBN, P rice).
book(T itle, P ublisher, P rice) ←
product(T itle, “Book”, P ublisher, P rice).
book(T itle, “ACM Press”, P rice) ←
acm proceedings(T itle, ISBN),
book price(ISBN, P rice).

Queries asked over the relation “book” can be answered by unfolding them
with the views. 

3.5.3 Systems
Research systems in this area have usually aimed at providing toolkits and de-
scription languages for automating the generation of mediators as far as possible.
Three notable research systems in this area, TSIMMIS [GMPQ+ 97], HERMES
[ACPS96], and Garlic [CHS+ 95], have been no exception. Since global-as-view
integration in its simplest (and relational) form is quite straightforward, research
systems also have put emphasis on advanced aspects such as multimedia data in-
tegration. In the following, we will have a somewhat closer look at the approach
taken in TSIMMIS.

TSIMMIS
TSIMMIS [GMPQ+ 97] (“The Stanford-IBM Manager of Multiple Information
Sources”) is a well-known research prototype that provides generators for me-
diators and wrappers. The generation of mediators and wrappers is a widely
proposed technique for leveraging the practical usefulness of the mediator ap-
proach. In this system, integration is based on the Object Exchange Model
(OEM) [PGMW95] of the Stanford Database Group, a simple semistructured
data model. It has also been used in other projects of that group, such as LORE
[AQM+ 97]. The Mediator Specification Language (MSL) uses a syntax sim-
ilar to datalog but which has been extended to the semistructured paradigm
[ABS00, TMD92, PGMW95].
3.6. LOCAL-AS-VIEW INTEGRATION 53

Mediator definitions are declaratively specified and can then be compiled down
to mediators. Of course, such mediator definitions can only be changed (or new
mediators added) offline, that is, changes require the definitions to be recompiled.
The semistructured data model and query language used in TSIMMIS also
allows for data sources that only supply data for some of the attributes in a
mediator interface, which we cannot appropriately match with relational database
views in the spirit of Example 3.5.1. For instance, it is possible to define mappings
of two sources s1 , s2 against a mediated relation r by

∀x, y ∃z : s1 (x, y) → r(x, y, z) ∀x, z ∃y : s2 (x, z) → r(x, y, z)
which would not satisfy the range restriction requirement when expressed as a
pair of conjunctive logical views. Given knowledge that the first attribute of r
functionally determines the other two (as object identifiers in OEM of course do),
the two views could nevertheless be used to answer a query such as q(x) ← r(x, y)
by compiling the above mappings into a mediator for the view

r(x, y, z) ← s1 (x, y), s2(x, z).

More Research Systems
The HERMES system [ACPS96] is another mediator toolkit that aims at a wide
goal of providing a complete methodology for source integration. The design of
the system has taken special care to permit the integration multimedia sources.
The system supports parameterized procedure calls that may be defined for ac-
cessing restricted sources and are then used by HERMES mediators to answer
queries. The Garlic system [CHS+ 95] is a research prototype that, similarly to
HERMES, aims at integrating multimedia sources. Other systems that clearly
fall into the global-as-view category and that we have shortly touched earlier were
CARNOT and multi-agent systems such as KRAFT [PHG+ 99].

3.6 Local-as-view Integration and the Problem
of Answering Queries Using Views
Local-as-view integration (LAV) is strongly related to the database-theoretic
problem of answering (rewriting) queries using views [YL87, LMSS95, DGL00,
AD98, BLR97, RSU95, PV99, SDJL96, PL00, CDLV00a], which will be discussed
in more detail in this section.
Within data integration, the local-as-view approach is applied in global infor-
mation systems architectures. Influential LAV data integration systems include
the Information Manifold [LRO96], InfoMaster [GKD97], and SIMS [AAA+ 97,
AK92, HK93]. Beyond data integration, the problem of answering queries us-
ing views has also been found relevant for query optimization [CKPS95] (where
54 CHAPTER 3. DATA INTEGRATION

previously materialized queries are used to answer similar queries14 ), the mainte-
nance of physical data independence [TSI94], and Web-site management systems
[FFKL98].

3.6.1 Answering Queries using Views
The local-as-view approach is based on the notion of a “global” mediated schema,
that is, a specially designed integration schema. The content of “local” sources
is described by logical views in terms of the predicates of the “global” schema
(thus the term local-as-view). Given “global” predicates p1 , . . . , pn and a source
v, a LAV view can be defined as

v(X̄) ← p1 (X̄1 ), . . . , pn (X̄n ).

Assuming a query over global predicates p1 , . . . , pm , this query can be auto-
matically rewritten by the system to contain only source predicates (such as v)
instead of the global predicates.
For the purpose of data integration, we consider only the case where one
searches for complete rewritings, which are rewritings in which all global pred-
icates have been replaced by views. We aim at producing rewritings that are
minimal . A conjunctive query Q is minimal if there is no conjunctive query
Q′ such that Q ≡ Q′ and Q′ has fewer subgoals than Q (see Section 2.4). For
the minimality of positive queries (as sets of conjunctive queries) we furthermore
require that conjunctive member queries are pairwise nonredundant, i.e. for a
positive query {Q1 , . . . , Qn }, we require Qi 6⊆ Qj and Qj 6⊆ Qi for each pair
i, j ∈ {1, . . . , n}.
One can either attempt to find equivalent rewritings or maximally contained
rewritings. Given a conjunctive query Q and a set of conjunctive views V, an
equivalent rewriting Q′ – if it exists – is a conjunctive query Q′ that only uses
the views and which, when expanded with the views, is equivalent to Q. Given a
conjunctive query Q and a set of conjunctive views V, Q′ is a maximally contained
rewriting15 (w.r.t. the positive queries16 ) if and only if each member query is, when
expanded using the views, contained in Q and there is no conjunctive query Q′′
s.t. when expanded with the views, it is contained in Q but Q′′ is not contained
in any of the member queries of Q′ . In general, it is not always possible to
14
The problem of answering queries using views is thus indirectly important to global-as-view
data integration approaches such as data warehousing as well.
15
Note that our definition of maximally contained rewritings is different from Levy’s [PL00]
where a rewriting is only maximally contained if it has the properties we enumerate and there
is at least one database for which the result of the original query is strictly larger than the
result of the rewriting. Under our definition, however, equivalent rewritings are also maximally
contained.
16
Maximally contained rewritings need to be defined relative to a query language.
3.6. LOCAL-AS-VIEW INTEGRATION 55

find an equivalent rewriting, and the maximally contained rewriting – as a set of
conjunctive queries – may be empty.
Equivalent rewritings require that views be complete, as is usually the case
for true materialized database views. In a data integration setting, it is usually
appropriate to consider sources to be possibly incomplete.

Example 3.6.1 [Ull97] Suppose we have a global schema with a virtual pred-
icate p (“parent of”), a query

q(x, y) ← p(x, u), p(u, v), p(v, y).

and two sources s1 (“grandparent of”) and s2 (“parent of someone who is also a
parent”). We can define the following logical views

s1 (x, z) ← p(x, y), p(y, z).

s2 (x, y) ← p(x, y), p(y, z).
Let us first assume that the two views are complete, i.e. that they logically
correspond to the constraints

{hx, zi | s1 (x, z)} ≡ {hx, zi | ∃y : p(x, y) ∧ p(y, z)}

{hx, yi | s2 (x, y)} ≡ {hx, yi | ∃z : p(x, y) ∧ p(y, z)}
There is an equivalent rewriting of q:

q ′ (x, z) ← s2 (x, y), s1 (y, z).

Now if we assume that our views are incomplete sources in a data integration
system, they correspond to the logical constraints

{hx, zi | s1 (x, z)} ⊆ {hx, zi | ∃y : p(x, y) ∧ p(y, z)}

{hx, yi | s2 (x, y)} ⊆ {hx, yi | ∃z : p(x, y) ∧ p(y, z)}
meaning that s1 is a source of grandparent relationships and s2 is a source of
parent relationships where the children are themselves parents, but both sources
do not necessarily provide all such relationships (although they only provide such
relationships). The implication direction of the conjunctive views shown above is
thus somewhat misleading, while the constraints based on set-theoretic notation
employed above are exact.
It is possible to show that the following positive query is a maximally con-
tained rewriting (as a set of conjunctive queries) of q that only uses the (incom-
plete) views s1 and s2 :
q ′ (x, z) ← s1 (x, y), s2 (y, z).
56 CHAPTER 3. DATA INTEGRATION

q ′ (x, z) ← s2 (x, y), s1 (y, z).
q ′ (w, z) ← s2 (w, x), s2(x, y), s2 (y, z).
Note that this rewriting is also nonredundant and minimal in the sense that we
cannot remove any member queries or subgoals and retain a maximally contained
rewriting. 

It can be shown that if both q and the views in V are conjunctive queries
(CQs) without arithmetic comparison predicates, then it is sufficient to consider
only rewritings with at most as many subgoals (views) as the original query
[LMSS95] as candidates for both equivalent and maximally contained rewritings.
(See also [Ull97].) A naive algorithm for finding an equivalent rewriting is thus
to guess an arbitrary rewriting Q′ of Q with at most as many subgoals as in Q
which uses only the views in V and then to check if Q′ is equivalent to Q. For
maximally contained positive rewritings, one can incrementally build a maximal
set of rewritings by searching the whole space of such rewritings (which is finite).
The problem of answering queries using logical views is NP-complete already
in the simple case of conjunctive queries without arithmetic comparison predicates
[CM77, LMSS95]. Thus this is a presumably hard reasoning problem. However,
it spares the human designer from having to carry out the rewriting task by
hand17 . For more expressive classes of query languages, the problem is harder or
undecidable [vdM92, SDJL96, CV92, Shm87].

3.6.2 Algorithms
Several improvements over the naive query rewriting algorithm have been pro-
posed, among them the Bucket algorithm of the Information Manifold [LRO96],
the Inverse Rules algorithm [DG97] of the InfoMaster System [GKD97], the
MiniCon algorithm [PL00], OCCAM [KW96], and the Unification-join algorithm
[Qia96]. We will discuss three of these algorithms in more detail.
The Bucket algorithm uses the following simple optimization over the naive
algorithm. For each of the subgoals of a given query, each of the views is indepen-
dently checked if it is possibly relevant to the process of replacing that subgoal.
Such candidate views are collected in “buckets”, one for each subgoal. Exhaus-
tive search is then carried out in the cartesian product of the buckets. Thus the
necessary search space required for combining the views in the buckets is pruned
compared to the naive algorithm.
The inverse rules algorithm first transforms the views into Horn clauses. The
queries can then be answered by executing the combination of the query and the
Horn clauses representing the views as a logic program, in a bottom-up fashion.
17
In global-as-view integration, on the other hand, mediators have to be specially designed
in order to be able to answer a certain repertoire of queries.
3.6. LOCAL-AS-VIEW INTEGRATION 57

Example 3.6.2 In the inverse rules algorithm, the views of Example 3.6.1 (under
the incomplete views semantics) correspond to the Horn Clauses

p(x, f1 (x, z)) ← s1 (x, z). p(f1 (x, z), z) ← s1 (x, z).
p(x, y) ← s2 (x, y). p(y, f2(x, y)) ← s2 (x, y).

Given instances s1 = {ha, ci, hb, di} and s2 = {ha, bi, hb, ci}, we can derive

p(a, f1 (a, c)), p(f1 (a, c), c), p(b, f1 (b, d)), p(f1 (b, d), d),
p(a, b), p(b, f2 (a, b)), p(b, c), p(c, f2 (b, c))

and finally q(a, d) as the answer to the query of Example 3.6.1. 

Such a logic program can be transformed into an equivalent (function-free)
nonrecursive datalog program, which can be unfolded into a set of conjunctive
queries using a simple transformation [DG97].
The MiniCon algorithm uses information about variables occurring in queries
for finding maximally contained rewritings. The MiniCon algorithm is based on
the notion of MiniCon descriptions (MCD)18 .

Definition 3.6.3 Given a conjunctive query Q and a set of views V, an MCD
m is a tuple hVkm , hm , Gm , φm i of

• A view Vkm ∈ V.
• A head homomorphism 19 hm on the view Vkm .
• A set Gm ⊆ Body(Q) of subgoals of Q.

• A function φm : V ars(Gm ) → V ars(Vkm ) that maps the variables in the
subgoals Gm of Q into the variables of Vkm .

that satisfies the following properties.

• For each g ∈ Gm , there is a subgoal of our view such that φm (g) ∈
Body(hm(Vkm )). (Gm is not necessarily the largest such set of subgoals
of Q.)
18
Informally speaking, an MCD represents a fragment of a containment mapping from the
query to its rewriting encompassing only the application of a single view and which is in a sense
atomic.
19
A head homomorphism h : V ars(V ) → V ars(V ) is a mapping of variables that is the
identity h(v) = v on variables not in the head of the view and maps head variables to head
variables; more exactly, a head variable v ∈ V ars(Head(V )) is either mapped to itself (h(v) =
v) or to another head variable for which h is the identity, i.e.

h(v) = w, w ∈ V ars(Head(V )), h(w) = w
58 CHAPTER 3. DATA INTEGRATION

q(x, y) ← p(x, u), p(u, v), p(v, y).

φ(x) φ(u) φ(u) φ(v) φ(v) φ(y)
— — — — — —
m1 : s1 (x, z) ← p(x, y), p(y, z).
m2 : s1 (x, z) ← p(x, y), p(y, z).
m3 : s2 (x, z) ← p(x, y), p(y, z).
m4 : s2 (x, z) ← p(x, y), p(y, z).
m5 : s2 (x, z) ← p(x, y), p(y, z).

Figure 3.6: MiniCon descriptions of the query and views of Example 3.6.1.

• For each variable v ∈ V ars(Head(Q)), φm (v) ∈ Head(hm (Vkm )).

• For each variable v ∈ V ars(Q) for which

φm (v) ∈ V ars(hm (Vkm )) − V ars(Head(hm (Vkm )))

(i.e., φm (v) is among the existentially quantified variables20 of the head
homomorphism on the view), all other subgoals in Q that contain v are in
Gm .

• m is minimal in the sense that there is no subset of Gm s.t. the previous
property remains true.

• hm is the least restrictive head homomorphism necessary in order to allow
the view and query subgoals to be unified.



This is best explained with an example.

Example 3.6.4 For the query and the views of Example 3.6.1, there are five
MCDs (see Figure 3.6). Note that for all MCDs and variables, their head homo-
morphism is the identity (id(v) = v for all variables in the respective view), so
we do not explicitly state it. For brevity, let g1 , g2, g3 denote the three subgoals
of Q.
m1 = hs1 , id, G1 = {g1 , g2 }, φ1i with φ1 (x) = x, φ1 (u) = y, φ1(v) = z.
20
In the data integration setting, these are thus the attributes that were projected out in the
materialized views. Data for them are not available and the variables bound to these attributes
cannot only not be bound to head variables of the query but also must not occur in any subgoals
of Q left to be covered by other MCDs to produce a rewriting. This would require a join of two
source views by attributes that are “not available”.
3.6. LOCAL-AS-VIEW INTEGRATION 59

m2 = hs1 , id, G2 = {g2 , g3 }, φ2i with φ2 (u) = x, φ2 (v) = y, φ2(y) = z.
m3 = hs2 , id, G3 = {g1 }, φ3i with φ3 (x) = x, φ3 (u) = y.
m4 = hs2 , id, G4 = {g2 }, φ4i with φ4 (u) = x, φ4 (v) = y.
m5 = hs2 , id, G5 = {g3 }, φ5i with φ5 (v) = x, φ5 (y) = y. 

Given the set M of all MiniCon descriptions for a query Q and a set of views
V, all conjunctive queries that have to be considered for a maximally contained
positive rewriting of Q can be constructed from combinations m1 , . . . , mk of ele-
ments of M for whichVthe sets Gm1 . . . Gmk are a disjoint n-partition of the set of
all subgoals in Q, i.e. (Gm1 ∪. . .∪Gmk ) = Body(Q) and Gmi ∪Gmj = ∅ for each
pair i, j in 1 . . . k. Note that one also does not have to compute any containment
mappings as needed in the Bucket algorithm anymore, as this is already implicit
in the combination of hi and φi .

Example 3.6.5 Let M be the set of five MCDs of the previous example. There
are three n-partitions of {g1 , g2 , g3} using G1 . . . G5 , namely {G1 , G5 }, {G3 , G2 },
and {G3 , G4 , G5 }. The rewritings producible from these partitions are those of
Example 3.6.1. 

We arrive at the maximally contained rewriting of Q by transforming each
of the partitions in the following way. Let {m1 , . . . , mn } be such a partition.
We apply φ−1i (hi (Vki )) for each MCD mi and combine the transposed views by
conjunction into conjunctive queries. For those variables of a view for which φ−1
is undefined, i.e. variables that only appear in subgoals of the view that are not
matched with any of the subgoals of the query in the MCD, new variable names
need to be invented.
Note that none of the three algorithms that we have discussed directly pro-
duces rewritings that are guaranteed to be minimal, so results have to be sepa-
rately optimized to obtain this property.
The Inverse Rules algorithm in its original formulation produces a datalog
rewriting, and rewritten views are kept separate from queries. In the case of
the rewriting of conjunctive queries, the rewriting process thus defers part of the
activity carried out by the other two algorithms to the time of query execution.
To compare this algorithm with the others, it is thus necessary to unfold the dat-
alog program produced by the Inverse Rules algorithm using the transformation
of [DL97b] (also discussed in Section 7.2) or include query execution into the
performance consideration.
Given moderately sophisticated techniques for executing datalog queries, the
Inverse Rules algorithm performs better than the brute-force bucket algorithm.
The MiniCon algorithm, which takes into account more problem-specific knowl-
edge and thus reduces the amount of redundant computations, however, in prac-
tice outperforms even the Inverse Rules algorithm in the altered form that unfolds
the rewritings into sets of conjunctive queries [PL00].
60 CHAPTER 3. DATA INTEGRATION

3.6.3 Bibliographic Notes
The theory of answering queries using views is surveyed in [Hal00] and [Lev00].
It is strongly related to the query containment problem, and is usually at least as
hard. The exception is the problem of answering datalog queries using conjunctive
views, which is efficiently solvable [DG97], while the related containment problem
is undecidable [Shm87]. On the other hand, the solution proposed in [DG97] does
not apply query rewriting in the strong sense21 .
The query rewriting problem in the presence of arithmetic comparison predi-
cates in the query and views has been addressed in [LMSS95] for the case of equiv-
alent rewritings. For the case of maximally contained rewritings, it is known that
no complete algorithm can exist, not even one that produces a recursive rewrit-
ing [AD98]. A sound algorithm that covers many practically important cases,
however, is presented in [PL00].
Queries with aggregation are addressed in [SDJL96]. The problem of an-
swering queries using views in object-oriented databases and OQL [CBB+ 97] has
been addressed in [FVR96]. The same problem in the case of regular path queries
in semistructured data models is discussed in [CDLV99, CDLV00b, CDLV00a].
The problem of answering queries using views with functional dependencies (over
the global predicates) has been addressed in [LMSS95] for the case of equivalent
rewritings, where the bound on the maximal number of subgoals only needs to be
slightly extended to the sum of the number of the subgoals in the original query
plus the sum of the arities of the subgoals. Maximally contained rewritings for
the same case may need to be recursive [DL97b, DGL00].

Binding Patterns (Adornments)
The problem of answering queries using views with binding patterns derives its rel-
evance from the fact that many sources in data integration systems have restricted
query interfaces. This is the case for legacy systems as well as for screen-scraping
Web interfaces where certain chunks of information may need to be provided s.t.
queries can be executed (e.g. book titles in online book stores). These restrictions
can be conveniently modeled using binding patterns22 .
A binding pattern is a mark telling, for each argument position of the predi-
cate, whether it is bound or free. At query execution time, variables in argument
positions marked “bound” have to be bound to constants before the extent of the
predicate is accessed (i.e., the source is queried).

Example 3.6.6 Consider the query q b,f (x, z) ← p(x, y), p(y, z). which requires
(and guarantees) that the variable x will be bound to a constant when executed.
21
We will apply this technique for rewriting recursive queries in Section 7.2.
22
Binding patterns or adornments have been used elsewhere, for instance in the theory of
optimizing recursive queries [Ull89].
3.6. LOCAL-AS-VIEW INTEGRATION 61

Furthermore, we have a view v b,f (x, y) ← p(x, y). for a source that can only
answer queries when provided “input” in its first attribute position. The query
can be rewritten into q b,f (x, z) ← v(x, y), v(y, z). However, the query q f,b (x, y) ←
p(x, y). cannot be rewritten because the only available source does not allow to
access p tuples without providing input for the first attribute position. 

Binding patterns allow the integration of data transformation functions into
the rewriting process, where input arguments of such functions are modeled as
“bound” and output arguments as “free”.
For the problem of computing equivalent rewritings given sources with binding
patterns, the search space is larger than in the case of the problem of answering
queries using views without binding patterns [RSU95], but the problem remains
NP-complete. Maximally contained rewritings may not be expressible as finite
sets of conjunctive queries, but can be encoded as recursive datalog programs
[KW96, DL97b, DGL00].
Algorithms and results bounding the search for equivalent rewritings have
been presented in [RSU95]. Earlier, queries with ”foreign functions” were consid-
ered in the context of query optimization in [CS93]. The Information Manifold
[LRO96], a system for integrating Web sources, supports source descriptions with
binding patterns that permit the specification of input and output attributes.
They are meant to facilitate the integration of sources that do not have full rela-
tional query capabilities, such as legacy sources or screen-scraping Web interfaces.

Answering Queries using Views under the Closed World Assumption
Note that we have so far discussed the problem of answering queries using views
in the light of an open-world assumption, which is appropriate in the context
of data integration and the assumption that sources may provide incomplete
information. It is also possible to approach the problem under a closed-world
semantics, centered around the notion of certain answers [AD98].

Example 3.6.7 Consider a query q(x, y) ← p(x, y). and sources v1 (x) ← p(x, y).
and v2 (y) ← p(x, y). Under the open-world assumption, this query cannot be
answered. Let the extents of v1 and v2 now be v1 = {hai} and v2 = {hbi}. Under
the closed-world assumption, we have the certain answer ha, bi to the query,
because the projections of the tuples in the extent of p are complete and entail
that certain answer. 

This problem and its complexity are discussed in [AD98, MLF00]. Note that
the problem of answering queries using views under the closed-world assumption
has the practical disadvantage that reasoning can only be done relative to the
data rather than the query, thus leading to a scalability problem.
62 CHAPTER 3. DATA INTEGRATION

3.7 Description Logics-based Information Inte-
gration
3.7.1 Description Logics
Description logics 23 (DL), also known as terminological logics or concept lan-
guages, are structured logical languages that are based on a well-designed tradeoff
between expressive power and complexity. The main goal is the design of lan-
guages that allow to conveniently express a large number of practical problems
related to concepts and objects while still remaining decidable24 . They can be
motivated by semantic networks, frame languages, terminological reasoning, and
semantic and object-oriented data models [RN95].
Description logics are usually constructed from unary relational predicates
(called concepts or concept classes) and binary relations (roles or attributes).
Instances of concepts are usually called individuals. Description logics are de-
fined by a fixed set of logical constructors, such as concept intersection C1 ⊓ C2 ,
union C1 ⊔ C2 and negation ¬C, all-quantification of roles with qualification ∀R.C
(denoting the concept {hxi | ∀y : R(x, y) → C(y)}), existential quantification,
which may (∃R.C, denoting {hxi | ∃y : R(x, y) ∧ C(y)}) or may not (∃R) sup-
port qualification, the conjunction and union of roles, the concatenation of roles
R1 ◦ R2 , number restrictions on roles ((≤ nR) and (≥ nR), where n is a constant
integer), and others25 . More complex concepts and roles are defined inductively
from atomic concepts and roles using the provided constructors.
Constraints are of the form C1 ⊑ C2 or C1 ≡ C2 , where C1 and C1 are concepts.
Constraints are subsumption (logical “containment” of the extents of the expres-
sions) relationships between concepts. For instance, the subsumption relationship
C1 ⊑ C2 expresses the logical constraint ∀x : C1 (x) → C2 (x).
The semantics of the languages are the straightforward classical logical one
applied to the special syntactical peculiarities of such languages. The syntax of
23
We restrict the presentation of description logics to a short overview. For a more detailed
introduction to this area see e.g. [DLNS96] or [Fra99].
24
However, the ancestor of description logics systems KL-ONE [BS85] was found not to have
this property [SS89]. The culprit was the same-as constructor, which allows to express concepts
of the form

∀y1 , y2 : (R1 (x, y1 ) ∧ R2 (x, y2 )) → y1 = y2

This constructor makes description logics lose the tree-model property [Var97] and their cor-
respondence with modal logics, and causes already the simplest and most restricted description
logics to become undecidable (see e.g. [DLNS96]). This problem was fixed in the successor sys-
tem CLASSIC [BPS94, BBMR89b] by a slight change of the semantics of extents (a “hack”).
Note also that the LOOM system [MB87], which is often listed among description logics sys-
tems, provides an incomplete reasoning service over a very expressive logical language.
25
These constructors are motivated by the ALC family of languages [SSS91, DLNS96]. See
[PSS93] for a standardization effort.
3.7. DESCRIPTION LOGICS-BASED INFORMATION INTEGRATION 63

most description logics languages differs from the classical syntax of first-order
logics because constraints in such concept languages usually can be expressed in
a variable-free form.
The main reasoning problems in description logics systems are subsumption
and classification. Subsumption is the logical implication problem in description
logics languages on the level of concepts. Given a set of constraints Σ in a DL
language, subsumption is the problem of deciding whether Σ implies the truth
of the logical formula corresponding to an additional constraint C1 ⊑ C2 . In
other words, this is the problem of deciding whether Σ implies that concept C1
is contained in C2 . The classification problem is to decide whether a certain
individual belongs to a given concept class.

3.7.2 Description Logics as a Database Paradigm
Description logics systems have been discussed as database systems before, e.g.
in the context of CLASSIC [BBMR89a, Bor95] and DLR [CDL98a, CDL+ 98b,
CDL99]. Description logics are relevant to data integration in two ways. Given
that queries are expressed as concepts and constraints express inter-schema rela-
tionships such as views,

• concept subsumption can be used to decide query containment under con-
straints and

• the classification of individuals (the objects of a database or a set of het-
erogeneous databases) can be used for answering queries in heterogeneous
databases.

Apart from that, description logics have been used to verify the consistency
of schemata [FN00]. Let us consider description logics subsumption and class-
ification as a way of performing data integration.

Example 3.7.1 Consider the following set of three constraints.
GrandparentOrNoParent ≡ Person ⊓ ∀child.(∃child.⊤)
ParentOfFerrariDriver ≡ Person ⊓ ∃child.(∃drives.Ferrari)
Ferrari ⊑ ItalianCar
Given our data integration setting, let GrandparentOrNoParent and Par-
entOfFerrariDriver be database relations with an extent (data sources). “child”
and “drives” are roles. The first constraint describes individuals of the class
GrandparentOrNoParent as persons whose children, if they have children, are
parents themselves. The name of the second source speaks for itself. In the third
constraint, we define Ferraris as Italian cars (that is, the concept class Ferrari is
a subclass of the class of Italian cars). Now let us ask a query for all persons who
have children that drive Ferraris and have children themselves.
64 CHAPTER 3. DATA INTEGRATION

Person ⊓ ∃child.((∃child.⊤) ⊓ (∃drives.Ferrari))

This query can be answered by attempting to classify all the individuals known
to the system. The answer will be the set of individuals that belong to both the
classes GrandparentOrNoParent and ParentOfFerrariDriver. It is also derivable
that our constraints imply
Person ⊓ ∃child.((∃child.⊤) ⊓ (∃drives.Ferrari)) ≡
GrandparentOrNoParent ⊓ ParentOfFerrariDriver
Also, we can determine that our set of constraints implies the subsumption
Person ⊓ ∃child.((∃child.⊤) ⊓ (∃drives.ItalianCar)) ⊒
GrandparentOrNoParent ⊓ ParentOfFerrariDriver.
but not equivalence. 

Note that the constraints of the previous example clearly follow a local-as-
view pattern26 . In general, however, constraints in description logics are truly
symmetric (In constraints of the form C1 ⊑ C2 or C1 ≡ C2 , both C1 and C2 may be
complex composed concept definitions representing queries), allowing to combine
global-as-view and local-as-view integration.
Recently, two kinds of extensions to the ALC-style languages (for which decid-
ability is of course preserved) have been proposed. Firstly, there has been work
on defining concepts using fixpoints for e.g. transitive roles (µALCQ [DL97a],
[HM00]) and that allow to express general regular path expressions, as they are
important in the context of queries over semistructured databases (for instance,
see the expressive description logic DLR [CDL98a, CDL+ 98b, CDL99]). Sec-
ondly, description logics (e.g., again, DLR) have dropped the requirement that
roles be binary relations. Instead, arbitrary relations may be used but have to
be projected down to binary before being used in constraints.
The restrictions and drawbacks of description logics for data integration are
threefold.

• Description logics provide two kinds of reasoning, query answering [CDL99]
by the classification of data and the verification of query containment by
subsumption. They do not lend themselves to query rewriting, however.
While it is possible to check, given a rewriting, if it is contained in the
input query, there is in general no way of finding such a rewriting given only
the input query. Query answering, however, is impractical, as it requires
all the data available in the system to be imported into the description
26
Note that description logics-based data integration is sometimes considered a case of local-
as-view integration. We kept the discussion separate to leave the work on the problem of
answering queries using views to its own section.
3.8. THE MODEL MANAGEMENT APPROACH 65

logics system, where each data object has to be independently classified for
membership in the concept class described by the query. This does not scale
to large databases and may not be feasible because data sources may have
restricted (e.g. screen-scraping) interfaces or be legacy systems, rendering
it impossible to extract “all” their data.
• Query languages are restricted to tree-style queries without any circulari-
ties. (Consider our earlier comment on same-as constraints and the entailed
undecidability.) For instance, this excludes simple queries such as

q(x) ← parent(x, y), employer(x, y).

3.7.3 Hybrid Reasoning Systems
For efficiency reasons, recent description logics systems (e.g. KRIS [BH91],
BACK [vLNPS87, NvL88], KRYPTON [BPGL85] and FaCT [Hor98]) have sepa-
rated the reasoning with concepts (TBox reasoning) from the reasoning with indi-
viduals (ABox reasoning [HM00]), using different techniques for the two problems
and creating hybrid reasoning systems [Neb89].
Hybrid knowledge representation systems have also been built by combining
description logics reasoning with deductive databases and nonmonotonic rea-
soning [DLNS98, Ros99] or local-as-view integration using database techniques
[LR96]. The Information Manifold [LRO96, BLR97], a local-as-view system with
query rewriting based on the Bucket algorithm uses the description logics CARIN
[LR96] to constrain concepts used in source descriptions (views)27 .

3.8 The Model Management Approach
The vision of the model management approach is to represent schemata and inter-
schema mappings as first-class objects in a repository28 [BLP00]. This approach
allows to define powerful operations on schemata and mappings such as the un-
folding (concatenation) of mappings and the application of mappings to schemata
in order to transform them.
Model management permits the computer-aided manipulation of such meta-
data using easy-to-use graphical user interfaces, as demonstrated by both research
systems (e.g. Clio [MHH+ 01], ONION [MKW00]) and commercial systems such
as Microsoft Repository [BB99]. The OBSERVER system [MIKS00] manages
several heterogeneous ontologies and mappings between them in a repository and
may be considered to be another pursuer of the model management approach.
27
This is an alternative role of description logics systems in data integration.
28
This relates to interesting research on logical languages for reasoning about schemata (e.g.
F-Logic [KL89], HiLog [CKW89], and Telos/ConceptBase [JGJ+ 95]) and meta-data query lan-
guages [LSS99, RVW99].
66 CHAPTER 3. DATA INTEGRATION

Schema matching techniques [MHH+ 01, MZ98, MKW00] have been used in
such systems for defining mappings between schemata. Most work in this area is
based on the definition of correspondences between schema objects (e.g. classes,
attributes, or relationships), often graphically, by drawing lines between them
[MKW00, MZ98, BLP00, MHH+ 01]. The formalisms for defining mappings have
often been quite restrictive, and agreed-upon semantics have not yet developed.
Systems such as Clio [MHH+ 01] propose several alternative semantics for such
correspondences for users to choose among.
Schema matching has also been used for XML data transformation [MZ98].
For data integration, these approaches have the drawback that the integration
problem is solved by processing the data rather than transforming the queries,
thus leading to a scalability problem.

3.9 Discussion of Approaches
Quality Factors of Data Integration Architectures
In this chapter, we have encountered a number of data integration architectures.
Given the integration problem motivated in Chapter 1, some of the main questions
regarding the quality of data integration architectures are

• Does the approach apply query rewriting or query answering? This is im-
portant because if the output of the data integration process is a query
which can be independently optimized and reformulated, performance im-
provements are possible that otherwise would not be attainable. The sepa-
ration is also important because in some approaches, the complexity of inte-
gration by data transformation is much harder than just executing queries
arriving at the same results, if such queries exist and can be computed. Fi-
nally, such a separation allows to select the best implementations for both
problems – core data integration and query evaluation – independently.

• Does the approach use a global schema against which all sources are in-
tegrated, or may there be several different schemata against which data
integration is carried out? The first approach may be preferable from a
standpoint of managing mappings. If there is only a single integration
schema, fewer mappings may be needed than if there are many. Note that
given m integration schemata and n sources, of the order of m ∗ n mappings
may be needed to integrate them. (That makes m2 in a federated database
system.) Clearly, a global integration schema (m = 1) is usually preferable
over a quadratically growing number of mappings.
However, the integration problem may require support for multiple au-
tonomous integration schemata, which may evolve independently. Change
of requirements may lead to the evolution of schemata against which data
3.9. DISCUSSION OF APPROACHES 67

Global-as-view/procedures Local-as-view
Management Problematic: change of a Good
of change to single source may require
sources the redesign of (many) me-
diators/procedures
Management Problematic: coupling of Problematic: change of
of change of mediator interfaces global schema requires
requirements global redesign of views

Figure 3.7: Comparison of global-as-view and local-as-view integration.

are integrated. If there are many mostly independent integration problems,
it may be preferable to avoid the creation of a single global schema. If there
are several smaller schemata and only one of them needs to be changed, one
can expect that fewer mappings will be affected.

• How stable and reusable are mappings when change occurs? Given a large
information infrastructure that needs to be managed, one does not want
changes to propagate through the system further than absolutely neces-
sary, invalidating other components that then need to be changed as well.
Subsystems should be largely decoupled, making changes manageable. Al-
ternatively, if changes do need to occur, it should be possible to automate
them as far as possible.
There are two kinds of changes that we want to differentiate between, the
change of sources and the change of integration requirements (or the evo-
lution of an integration schema or “global” schema).

• How well does the approach support the mapping of sources and integration
schemata that show serious concept mismatch? As we will show later in
this section, procedural approaches as well as simple view-based approaches
have their restrictions with respect to this issue, which are more severe
than it may appear at first sight. Declarative approaches with symmetric
constraints are those most desirable and complete.

Global-as-view versus Local-as-view Integration
Let us first compare local-as-view and global-as-view integration. A major ad-
vantage of local-as-view mappings is their maintainability (Figure 3.7). When
sources are added or removed, change remains local to those logical views that
define these sources. GAV mediators may require a major global redesign when
sources change, which may propagate through many mediators. Once a global
integration schema has been defined for LAV, this schema allows good decou-
pling between sources and the global information system, which is essential if
68 CHAPTER 3. DATA INTEGRATION

Query Global “Declarative” Symmetric
rewriting? schema? approach? constraints
Federated Databases no (?) no no no
Data Warehousing yes/no yes no no
Mediator systems no (yes) no no
Global Inf. Systems yes yes yes/no no
Description Logics no no yes yes
Model management yes/no no no (?) no

Figure 3.8: Comparison of Data Integration Architectures.

ease of change is an issue. However, designing an appropriate global schema
for local-as-view integration is hard, and requires a good understanding of the
domain. Furthermore, the application of the local-as-view approach is only rea-
sonable if the overall goals and requirements of the global information system
do not change; otherwise, the global schema as well as all defined logical views
may quickly become invalid and require complete redesign. The interfaces that
GAV mediators export, on the other hand, often follow quite straightforwardly
and naturally from the sources that have to be combined.
LAV has sometimes been called a declarative method, and GAV procedural.
Indeed, the design of “schemata” that global-as-view mediators export are usually
more restrictive as to what kinds of queries can be asked than in LAV, where less
knowledge about how queries are answered is put into the views at design time
and more is decided at runtime. Indeed, LAV takes a more global perspective
when answering queries than GAV (the overall integration schema becomes a
mediated schema [Lev00]).
As pointed out earlier in Chapter 1, both the local-as-view and the global-as-
view approach make a very important assumption. It is supposed that the “global
schemata” resp. interfaces exported by mediators29 can be designed at will for the
special purpose of integrating a number of sources. Either approach fails if this
assumption does not hold. For instance, consider the case of Example 3.6.1. We
cannot build a GAV mediator that answers any queries using the given sources
if we are required to export a “parents”(-only) interface. Conversely, imagine
source relations containing attributes that have no analog in the global logical
schema in the case of LAV.

Comparison of Architectures
Now consider Figure 3.8, in which we compare the data integration architectures
discussed in this chapter.
29
These are in a sense “global” as well, because if they are not general enough, they will have
to be redesigned when further sources are added to a mediator.
3.9. DISCUSSION OF APPROACHES 69

• Federated databases support the autonomy of component databases. There
is thus no central “global” schema in the architecture. Traditionally, data
have been translated procedurally between schemata, although this is in
principle not a necessity.

• In the data warehousing architecture, sources are integrated against a single
sophisticated global warehouse schema. Integration is usually global-as-
view and procedural.

• Mediators à la [Wie92] apply query answering in a procedural manner. Al-
though mediators in systems such as TSIMMIS [GMPQ+ 97] are specified
declaratively, these specifications are compiled down into software compo-
nents that answer queries on the level of data. Global-as-view integration
by database views is based on query rewriting. However, mediators do not
take a global perspective with respect to the schema as known from local-
as-view and description logics integration. Although database views can be
considered as constraints under a declarative semantics, no global reasoning
under this semantics will lead to more complete results than just using the
views independently.
Mediators independently export interfaces according to which they can pro-
vide integrated data. Mediators making use of the services of other me-
diators are strongly coupled via their interfaces (see Figure 3.7). While
the mediator architecture at first sight does not rely on a global schema,
this coupling entails the usual disadvantages of global schemata, namely
that changes of requirements may lead to the need of a global, very work-
intensive redesign of many components (mediators) of the system.

• Global-information systems may either use GAV or LAV integration. The
first case is not substantially different from the mediator approach just
discussed. Local-as-view integration has been discussed in sufficient detail
earlier.

• Description logics system use a declarative approach with symmetric con-
straints, allowing to encode both mappings usually to be considered local-
as-view and such usually considered global-as-view. The designer may ef-
fectively define a global schema against which all sources are integrated,
but is free to do otherwise. Unfortunately, the approach does not only
rule out query rewriting, worse, there is usually a high data complexity for
answering queries, compromising scalability.

• The model management approach at the core leaves open which integration
technology is to be used. While state-of-the-art research often uses very
restrictive mappings with a somewhat declarative flavor, one is free to make
70 CHAPTER 3. DATA INTEGRATION

other choices. Since integration schemata are just objects among many, no
global schema strategy can be observed.

We are now in the position to apply the lessons learned from previous work
to our problem of Section 1.2.
Chapter 4

A Short Sightseeing Tour
through our Reference
Architecture

4.1 Architecture
The data integration architecture of Figure 4.1 will be made our reference for the
presentation of the contributions of this thesis. It contains a number of informa-
tion systems that retain design autonomy for their schemata, data models, and
query languages. Each information system may contain a number of databases
and processes which access and manipulate local data. For simplicity, but with-
out loss of generality, we assume the information systems to logically each contain
a single database over a single consistent schema. Other cases are handled by
either using distributed database techniques locally or splitting one information
system up into several systems that are considered independent for data inte-
gration purposes. Schemata may contain both true “source” entities for which
the local database holds data and logical entities over which local queries can be
executed as well, but for which it is the data integration system’s task to gather
and provide mediated data from other information systems.
Component information systems may be structurally heterogeneous. In order
to make integration possible, the overall information infrastructure of Figure 4.1
is assumed to have a “global” data model, query language, and format for com-
municating data (results to queries). Component information systems may each
differ in their choices of such structural factors.
A model management repository is part of the data integration architecture.
It stores “proxies”, copies of each schema in an information system in the infras-
tructure, as a first-class object subject to manipulation in the repository. These
proxy schemata are of course expressed in the global data model1 used in the
1
In this thesis, the relational data model will occupy this role.

71
72 CHAPTER 4. REFERENCE ARCHITECTURE

Repository Information systems
Editor

n
tio
sla
n
schema

Tra
data
Schemata Query
Rewriting Mediator
Proxy Proxy Query Facility
relational
schema Phys. Plan Mediator
Generation Proxy Query Facility

Query Plan Mediator
Proxy
Mappings Execution Query Facility

Repository Mediator

Figure 4.1: Reference Architecture

repository. Mappings (as sets of symmetric inter-schema constraints) are stored
in the repository and accessed by the a data integration reasoning services (which
will be referred to as the mediator in the tradition of [JLVV00]). The reasoning
services are assumed to have been implemented only once, “globally”, for the
“global” data model and query language. Locally, inside the information sys-
tems, there are mediator “proxies”, which accept queries using the local query
language, relative to the schema over the local data model, but delegate their
answering, after translation to global data model and query language, to the me-
diator. Mediated queries can be issued either inside an information system using
the local data model and query language or directly against the global mediator.
The most common vehicles of structural integration used throughout the data
integration approaches of Chapter 3 are wrappers [GMPQ+ 97, RS97, GK94]. The
use of wrappers is appropriate for the structural integration of (legacy) informa-
tion systems that act as sources to some global information system only. The
metaphor of wrappers is insufficient in architectures with several heterogeneous
information systems that each may need access to integrated data. We propose
a different (and bi-directional) mechanism for structural integration, which may
be conceptualized in analogy with the cell membranes of living organisms. In our
context, heterogeneous information systems each are enclosed by some transla-
tion membrane, which transforms incoming queries and data from the global data
model and query language to the local one, and does the opposite for outgoing
queries, data, and schema information2 . If the structural design choices of some
component information system have been the same as those of the global data in-
tegration infrastructure, such a membrane is of course not needed. In the case of
component information systems that do not need to access integrated data from
other information systems, one may revert to the simpler wrapping approach.
2
Information that may be on its way into the model management repository.
4.2. MEDIATING A QUERY 73

4.2 Mediating a Query
In general, queries are answered as follows. Initially, a query Q is issued against
one of the mediator proxies inside a component information system IS. This
query is then sent to an instance of the mediator. When crossing the boundary
of IS, Q is translated into a query Q′ in the “global” query language over the
proxy schema of IS, which is a citizen of the model management repository.
The mediator first rewrites Q′ into a query Q′′ over source predicates only,
using schema information and inter-schema constraints from the repository. This
query is then decomposed into an executable (distributed) query plan, which may
be optimized using cost-based metrics and special evaluation techniques known
from the distributed database field [MY95, OV99, Ull89]. To execute Q′′ , the
queries over component databases specified in the distributed query plan are sent
off to the individual information systems containing those databases.
While traversing the translation membrane surrounding component informa-
tion systems, the queries Qi are translated into queries Q′i over the local query
languages and modified to use the schemata over the local data models. These
queries are then passed on to the local query facilities, which execute them and
return data in formats relative to the local data models.
On the way back “out” of the component information systems and to the me-
diator, the data are translated to correspond to their schemata over the “global”
data model and are passed on to the mediator. There the data are combined
into one consistent result for Q′′ . This is then passed on to IS. On the way
through the component information system’s membrane, the result is reformu-
lated to adhere to Q and to the local data model of that component information
system.

4.3 Research Issues
The following chapters will address the two main voids of our proposed approach
left, which are query rewriting and the management of mappings under change.
Although much of what we have discussed in this chapter relates to structural
integration (this was done to have had it covered, such that we can subsequently
focus on semantic integration), the problems related to it have been seen before
and are sufficiently well understood [GMPQ+ 97, RS97, GK94]. Similarly, dis-
tributed query execution is quite well understood once a logical query plan exists
[MY95, OV99, Ull89].
Data integration encompasses various aspects of data reconciliation that we
will, as simplifying assumptions, assume to be implicit in the query rewriting
problem or simply excluded from consideration. For instance, object identification
[JLVV00] is the issue of matching objects from different databases which may be
identified by keys from distinct domains, or which may have no keys at all. This
74 CHAPTER 4. REFERENCE ARCHITECTURE

problem has spurred some research of its own (e.g. [ZHKF95b]), but to a degree
such problems may be dealt with in our framework, as shown in the example in
Section 1.3. Another argument in favor of this stance also applies to a related data
reconciliation problem, data cleaning [JLVV00]. In fact, much of the intricacies
of these problems are related to mismatching erroneous data, inconsistencies that
often arise in the context of manually-acquired data. However, in our high-energy
physics use case of Section 1.3, for example, such data are rare. Data are usually
also well-identified by cleanly thought-through domains of identifiers.
The rationale behind the first main contribution, query rewriting with sym-
metric inter-schema constraints (Chapter 5), on the other hand, is the following.
Expressive constraints are required for two reasons.

• The need to deal with concept mismatch which results from schemata being
integrated against others that have not been conceived for data integration,
and which may be a consequence of schema evolution.

• The need for flexibility that allows to anticipate future change of schemata
and requirements in the design of mappings. This includes the need for ex-
pressiveness that allows to prepare mappings for the merging of schemata,
and to emulate local-as-view integration even when sources cannot be de-
clared as views over the logical entities of the integration schemata.

The information infrastructure that has been outlined in this chapter can be
seen from a federated database perspective. There are several databases that
have design autonomy for their schemata (as well as for data models and query
languages), and each need to share data. As is well known for federated databases,
the lack of a “global” schema for data integration leads to the uncomfortable
situation that given N schemata, N 2 mappings between them need to be created
and managed. Given our requirement that schemata and integration requirements
may change, it is clear that the management task is difficult.
A surprising breakthrough on the management front is not to be expected.
Similar issues have been studied in various contexts by a large number of re-
searchers in the fields of software engineering, database (schema) design, and on-
tological engineering. The solutions that have been developed all center around
common ideas: the treatment of the artifacts to be managed as first-class cit-
izens on which clearly defined and powerful operations are developed that can
be used to manipulate them with the greatest possible amount of automation
and computer support, as well as the use of design patterns, best practices, and
design heuristics. We thus propose exactly such a solution, a model management
approach in combination with a methodology for managing mappings and their
change (Chapter 6).
Chapter 5

Query Rewriting with Symmetric
Constraints

5.1 Outline
In this chapter, we address the query rewriting problem of data integration in a
very general setting. To start somewhere, we take the common approach of re-
stricting ourselves to the relational data model and conjunctive queries. We drop
the assumption of the existence of a single coherent global integration model over
which queries may be asked, which are then rewritten into queries in terms of
source predicates. Given a conjunctive (or positive) relational query over (possi-
bly) both virtual and source predicates, we attempt to find a maximally contained
rewriting in terms of only source predicates under a given appropriate semantics
and a set of constraints, and the positive queries as a query language (i.e., the
output is a set of conjunctive queries). We support symmetric constraints in the
form of what we call Conjunctive Inclusion Dependencies (cind’s), containment
relationships between conjunctive queries.
We propose two alternative justifiable semantics, the classical logical and a
straightforward rewrite systems semantics 1 . Under both, the problem is a proper
generalization of the local-as-view as well as the global-as-view approaches.
In many real-life situations where neither source relations can be defined as
views over a given set of virtual relations nor a virtual relation as a view over a
number of sources, a satisfactory containment relationship between conjunctive
queries can be formulated using cind’s. Apart from that, our type of constraints
allows to map schemata in a model management context using a clean and expres-
sive semantics or to “patch” local-as-view or global-as-view integration systems
1
Informally speaking, the intuition of this second semantics is that given a conjunctive query
Q, a subexpression E of Q, and a cind Q1 ⊇ Q2 , if we can produce a contained rewriting under
the semantics of the problem of answering queries using views where we take E as query and
Q1 as logical view, we can replace (while applying the respective variable mappings) E in Q by
Q2 to produce a rewriting that is again “contained” in Q.

75
76 CHAPTER 5. QUERY REWRITING

when sources need to be integrated whose particularities have not been foreseen
when designing the integration schemata. The problem may also be relevant for
maintaining physical data independence under schema evolution (see Section 7.1).
Unfortunately, (as is immediately clear for the classical semantics), such pos-
itive rewritings may be infinite and the major decision problems (such as the
nonemptiness or boundedness of the result) are undecidable. However, given
that the predicate dependency graph (with respect to the inclusion direction) of
a set of constraints is acyclic, we can guarantee to find the maximally contained
rewritings under both semantics, which are finite. We will argue that for ob-
taining maximally contained rewritings in the data integration context, we can
require the constraints to be acyclic without much inconvenience; rather, it may
even be desirable.
As contributions of this chapter, we first provide characterizations of both
semantics as well as algorithms which, given a conjunctive query, enumerate the
maximally contained rewritings. We discuss various relevant aspects of query
rewriting in our context, such as the minimality and nonredundancy of conjunc-
tive queries in the rewritings. Next we compare the two semantics and argue that
the second is more intuitive and may fit better the expectations of human users
of data integration systems than the first. Following the philosophy of that se-
mantics, rewritings can be computed by making use of database techniques such
as query optimization and ideas from e.g. algorithms developed for the problem
of answering queries using views. We believe that in a practical information
integration context there are certain regularities (such as sets of predicates –
schemata – from which predicates are used together in queries, while there are
few queries that combine predicates from several schemata) that render query
rewriting following the intuitions of the second semantics more efficient in prac-
tice. Surprisingly, however, it can be shown that the two semantics coincide. We
then present a scalable algorithm for the rewrite systems semantics (based on
previous work such as [PL00]), which we have implemented in a practical system,
CindRew. We evaluate it experimentally against other algorithms for the same
and for the classical logical semantics. It turns out that our implementation,
which we make available for download, scales to thousands of constraints and
realistic applications. We conclude with a discussion of how our query rewriting
approach fits into state-of-the-art data integration systems.

5.2 Preliminaries
We define a conjunctive inclusion dependency (cind) as a constraint of the form
Q1 ⊆ Q2 where Q1 , Q2 are conjunctive queries (without arithmetic comparisons,
but possibly with constants) of the form

{hx1 , . . . , xn i | ∃xn+1 . . . xm : (p1 (X̄1 ) ∧ . . . ∧ pk (X̄k ))}
5.2. PRELIMINARIES 77

with a set of distinct 2 unbound variables x1 , . . . , xn . We may write {Q1 ≡ Q2 }
as a short form of {Q1 ⊆ Q2 , Q1 ⊇ Q2 }.
The normalization of a set Σ of cind’s is a set of Horn clauses, the set of
cind’s taken as a logical formula transformed into (implication) normal form.
These Horn clauses are of a simple pattern. Every cind σ of the form Q1 ⊆ Q2
with

Q1 = {hx1 , . . . , xn i | ∃xn+1 . . . xm : v1 (X̄1 ) ∧ . . . ∧ vk (X̄k )}
Q2 = {hy1, . . . , yn i | ∃yn+1 . . . ym′ : p1 (Ȳ1 ) ∧ . . . ∧ pk′ (Ȳk′ )}

translates to k ′ Horn clauses pi (Z̄i ) ← v1 (X̄1 ) ∧ . . . ∧ vk (X̄k )). where each zi,j
of Z̄i is determined as follows: If zi,j is a variable yh with 1 ≤ h ≤ n, replace it
with xh . If zi,j is a variable yh with n < h ≤ m′ , replace it with Skolem function
fσ,yh (x1 , . . . , xn ) (the subscript assures that the Skolem functions are unique for
a given constraint and variable).

Example 5.2.1 The normalization of the cind
σ : {hy1 , y2 i | ∃y3 : p1 (y1 , y3) ∧ p2 (y3 , y2 )}
⊇ {hx1 , x2 i | ∃x3 : v1 (x1 , x2 ) ∧ v2 (x1 , x3 )}
is
p1 (x1 , fσ,y3 (x1 , x2 )) ← v1 (x1 , x2 ) ∧ v2 (x1 , x3 ).
p2 (fσ,y3 (x1 , x2 ), x2 ) ← v1 (x1 , x2 ) ∧ v2 (x1 , x3 ).



Whenever a cind translates into a function-free clause in normal form, we will
write it in datalog notation. This is the case for cind’s of the form

{hX̄i | p(X̄)} ⊇ Q

i.e. the subsumer query is a ∃-free single-literal query.
The dependency graph of a set C of Horn clauses is the directed graph con-
structed by taking the predicates of C as nodes and adding, for each clause in C,
an edge from each of the body predicates to the head predicate. The diameter of
a directed acyclic graph is the longest directed path occurring in it. The depen-
dency graph of a set of cind’s is the dependency graph of its normalization. A set
of cind’s is cyclic if its dependency graph is cyclic. An acyclic set Σ of cind’s is
called layered if the predicates appearing in Σ can be partitioned into n disjoint
sets P1 , . . . , Pn s.t. there is an index i for each cind σ : Q1 ⊆ Q2 ∈ Σ such that
P reds(Body(Q1)) ⊆ Pi and P reds(Body(Q2)) ⊆ Pi+1 and Sources = P1 .
The problem that we want to address in this chapter is the following:
2
Note that if we would not require unbound variables in constituent queries to be distinct,
the transformation into normal form would result in Horn clauses with equality atoms as heads.
78 CHAPTER 5. QUERY REWRITING

Definition 5.2.2 (Query rewriting under symmetric constraints.) Given disjoint
sets of so-called “source” (materialized) and “virtual” predicates, a conjunctive
(or positive) query Q over possibly both sources and virtual predicates, and a
set Σ of cind’s, find the maximally contained positive query Q′ exclusively over
source predicates under a given semantics. 

Later in this chapter we will discuss two such semantics for this problem. The
maximally contained rewritings under these semantics will be defined analogously
to the case of the problem of answering queries using views. Note that we do not
require that the input query Q only contains virtual predicates; furthermore, we
do not by default have any special restrictions regarding a set of cind’s Σ, apart
from the following. Without loss of generality, and for simplicity, we assume that
no source predicates appear in any heads of Horn clauses created by normaliza-
tion of the cind’s. (We can always replace a source predicate that violates this
assumption by a new virtual predicate in all cind’s and then add a cind that
maps the source predicate to that new virtual predicate.)

5.3 Semantics
We discuss two alternative semantics for query rewriting, first the classical logical
and later a straightforward rewrite systems semantics.

5.3.1 The Classical Semantics
Let us begin with a straightforward remark on the containment problem for
conjunctive queries under a set of cind’s, which, since they are themselves con-
tainment relationships between conjunctive queries, is the implication problem
for this type of constraint. If we want to check a containment

{hX̄i | ∃Ȳ : φ(X̄, Ȳ )} ⊇ {hX̄i | ∃Z̄ : ψ(X̄, Z̄)}

of two conjunctive queries under a set Σ of cind’s by refutation (without loss of
generality, we assume Ȳ and Z̄ to be disjoint and the unbound variables in the
two queries above to be the same3 , X̄), we have to show

Σ, ¬(∀X̄ : (∃Ȳ : φ(X̄, Ȳ )) ← (∃Z̄ : ψ(X̄, Z̄)))  ⊥

i.e. the inconsistency of the constraints and the negation of the containment taken
together. In normal form, ψ becomes a set of ground facts where all variables
3
In the remainder of this chapter, we will implicitly – whenever we do not sacrifice clarity by
this – assume that variables from different clauses are distinct, or in different “name spaces”,
even if several instances of the same clause interfere with each other during unification or
unfolding and that new variables are automatically introduced where necessary to assure this.
5.3. SEMANTICS 79

have been replaced one-to-one by new constants and φ becomes a clause with
an empty head, where all distinguished variables xi have been replaced by the
constants also used for ψ.

Example 5.3.1 For proving the containment

{hx1 , x2 i | ∃x3 : (p1 (x1 , x3 ) ∧ p2 (x3 , x2 ))} ⊇
{hy1 , y2 i | ∃y3 : (r1 (y1, y3 ) ∧ r2 (y3 , y2 ))}

we have to translate it into

← p1 (α1 , x3 ) ∧ p2 (x3 , α2 ).

r1 (α1 , α3 ) ← . r2 (α3 , α2 ) ← .
where α1 , α2 , α3 are constants not appearing elsewhere. 

We have now transformed our original problem into a set of equivalent Horn
clauses, and can treat it as a logic program. We can take the single clause with
the empty head above (the body of the subsumer query) and use it as a goal for
refutation.

Definition 5.3.2 Under the classical semantics, a maximally contained rewrit-
ing of a conjunctive query Q is equivalent to the set of all conjunctive queries Q′
over source predicates for which Σ  Q′ ⊆ Q. 

We can obtain such a maximally contained rewriting in the following way.
Given a conjunctive query Q4 and the normalization C of a set of cind’s, we add
a unit clause (with a tuple of distinct variables) for each source atom5 . Then we
try to refute the body of Q. (Differently from what we do for containment, we do
not freeze any variables.) If we have found a refutation with a most general unifier
θ, we collect the unit clauses used and create a Horn clause with θ(Head(Q)) as
head and the application of θ to the instances of unit clauses involved in the proof
as body. If this clause is function-free, we output it; after that we go on as if we
had not found a “proof” to compute more rewritings. Given e.g. a breath-first
strategy, it is easy to see that this method will compute a maximally contained
rewriting of Q in terms of multisets of conjunctive queries in the sense that for
each conjunctive query contained in Q, a subsumer will eventually be produced.
See Example 5.3.10 for query rewriting by an altered refutation proof.
Equivalent rewritings can be computed by interleaving the computation of
contained rewritings with the verification if Q is contained in any of the already
computed rewritings.
4
The results in this chapter generalize to positive input queries in a straightforward manner.
5
We still assume that source predicates do not appear in any heads in C.
80 CHAPTER 5. QUERY REWRITING

Unfortunately, since we allow for arbitrary conjunctive queries as subsumees
in cind’s, we cannot make any guarantees regarding the minimality or nonredun-
dancy of rewritings. While it is of course possible to minimize conjunctive queries
when they are produced, it is impractical to require that the result be nonredun-
dant. It can for instance easily be seen that we can encode arbitrary (recursive)
datalog programs as sets of cind’s. Query rewriting may then produce an infinite
result, and the boundedness problem (that is, telling whether the result will be
finite) is undecidable. Thus, if an incomplete result is acceptable in such cases,
it is more appropriate to output rewritings as soon as they are found, and not to
eliminate redundancies.
We next present an alternative algorithm for computing maximally contained
rewritings which proceeds in a bottom-up fashion. The intuition of this procedure
can be used to unfold constraints early on where appropriate, which may allow
us to avoid recomputing certain intermediate results many times. It also only
needs a restricted kind of unification that we want to look at in more detail.

Algorithm 5.3.3 (Bottom-up query rewriting).
Input: The normalization C of a set of cind’s that do not contain source predi-
cates in the subsuming query, a conjunctive query Q, and a set of source predi-
cates S.
Output: A (multi-)set of conjunctive queries X exclusively over source predi-
cates.

X := {c ∈ C | P reds(Body(c)) ⊆ S};
C := C\X;
forever {
choose some clause c ∈ C ∪ {Q};
let n = |Body(c)|; θ := ∅;
choose some tuple hc1 , . . . , cn i with c1 , . . . , cn ∈ X ∪ {ǫ},
(ci = ǫ) iff P red(Bodyi(c)) ∈ S;
for each 1 ≤ i ≤ n with ci 6= ǫ do
θ := unify(θ, Body(c), Head(ci));
if θ 6= f ail then {
c′ := unfold(c, θ, hc1, . . . , cn i);
if (c 6= Q) ∧ (Body(c′) is function-free) then
X := X ∪ {c′ };
else if (c = Q)∧ (c′ is function-free) then
print c′ ;
}
if no new query or clause for X can be found then
exit;
}

5.3. SEMANTICS 81

We will now have a closer look at the functions “unify” and “unfold” which we
have used above and that we will meet again in that form later. “unify” takes a
most general unifier θ and two atoms a and b and produces a most general unifier
θ′ of a and b which is consistent with θ in the usual way, if one exists. Otherwise
the function returns f ail (and we assume the variables in the two atoms to be
from two distinct name spaces). We assume that a is always from the body of
a clause “higher up” and that b is the head of a clause whose body is to replace
that former atom.
Unification here is simpler than in general because we have the following
restrictions: (1) Body(c) is always function-free, which simplifies the implemen-
tation of unification. (2) Since the body of each valid query must be function-free
and once a clause contains a function term in its body, it cannot recover from
that state, we can exclude the possibility that a function term from Head(ci) gets
unified with a variable from Head(cj ). For the same reason, we can disallow that
function terms get unified with variables that appear in atoms a ∈ Body(a) where
P red(a) is a source. Secondly, when two function terms from Head(ci), Head(cj )
get unified with the same variable in c, they must be equal except for variable
renamings, because otherwise again subterms would get unified with variables
from some ck . (3) If a variable from c gets unified with a function term, it cannot
unify with any other variable. (4) If c is a query to be rewritten, we can block
all variables in Head(c) early on from being unified with function terms, as this
could again not lead to a function-free rewriting.
The function “unfold” accepts a Horn clause c with |Body(c)| = n, a unifier θ
and a tuple of n Horn clauses or ǫ s.t. if ci 6= ǫ, θ unifies Bodyi(c) with Head(ci ).
It produces a new Horn Clause from c by replacing each of its non-source body
atoms Bodyi (c), if ci 6= ǫ, by θ(Body(ci)). (i.e. after applying substitutions from
the unifier). If ci = ǫ, Bodyi (c′ ) = θ(Bodyi (c)).
If the clauses c1 , . . . , cn are from the normalization of a set of cind’s rather
than the unfolding of constraints (as produced by Algorithm 5.3.3), we may
avoid producing redundancies in the result by not including substituted bodies if
already another body from the same cind was included and this occurred under
the same substitution of all distinguished variables of that cind. A special case6
that is particularly easy to implement is when a variable of c has been unified
with a function term. In that case, only one body atom that contains this variable
needs to be substituted, all others can be dropped. This is the case because the
normalization of a cind will only produce function terms that contain all the
distinguished variables of the cind in a uniform manner. Therefore, when the
unification of a variable from c with two function terms with the same function
symbol succeeds, all the variables in a pair of function terms unified with the

6
This case is analogous to a technique that is part of the MiniCon algorithm [PL00], and
which allows to restrict oneself to including a view in a rewriting only once for each application
of a MiniCon description.
82 CHAPTER 5. QUERY REWRITING

same variable of c have been pairwise unified themselves.

5.3.2 The Rewrite Systems Semantics
The rewrite systems semantics is best defined using the notion of MiniCon de-
scriptions (see Definition 3.6.3). We adapt this notion to our framework based
on rewriting with Horn clauses.

Definition 5.3.4 (Inverse MiniCon Description). Let Q be a conjunctive query
with n = |Body(Q)| and C be the normalization of a set of cind’s. An (inverse)
MiniCon description for Q is a tuple hc1 , . . . , cn i ∈ (C ∪ {ǫ})n that satisfies the
following two conditions. (1) For the most general unifier θ 6= f ail arrived at by
unifying all the ci 6= ǫ with Bodyi (Q), the unfolding of Q and hc1 , . . . , cn i under
θ is function-free and (2) there is no tuple hc′1 , . . . , c′n i ∈ {c1 , ǫ} × . . . × {cn , ǫ}
with fewer entries different from ǫ than in hc1 , . . . , cn i, such that the unfolding of
Q with hc′1 , . . . , c′n i is function free. 

Note that the inverse MiniCon descriptions of Definition 5.3.4 exactly coincide
with the MCDs of Definition 3.6.3. The algorithm for computing maximally
contained rewritings shown below can easily be reformulated so as to use the
standard MCDs of [PL00]. That way, one can even escape the need to transform
cind’s into Horn clauses and can reason completely without the introduction of
function terms. However, to support the presentation of our results (particularly
the equivalence proof of the following section), we do not follow this path in this
chapter.
Maximally contained rewritings of a conjunctive query Q are now computed by
iteratively unfolding queries with single MiniCon descriptions7 until a rewriting
contains only source predicates in its body.

Algorithm 5.3.5 (Query rewriting with MCDs).
Input. A conjunctive query Q, the normalization C of a set of cind’s, and a set
S of source predicates
Output. A maximally contained rewriting of Q

Qs := [Q];
while Qs is not empty do {
[Q, Qs] := Qs;
if P reds(Q) ⊆ S then output Q;
else {
M := compute inverse MCDs for Q, C;
for each hc1 , . . . , cn i ∈ M do {
7
In this respect, the rewrite systems semantics differs from the MiniCon algorithm for the
problem of answering queries using views.
5.3. SEMANTICS 83

θ := ∅;
for each 1 ≤ i ≤ n do
θ := unify(θ, Bodyi (Q), ci );

Q := unfold(Q, θ, hc1 , . . . , cn i);
Qs := [Qs, Q′ ];
} } }

“unify” is the restricted kind of unification that we discussed in the previous
section, with the additional constraint that now all function terms are of depth
one (that is, there are no function terms that have function terms as subterms).
Definition 5.3.6 (Rewrite Systems Semantics). Let Q be a conjunctive query,
S a set of source predicates, and Σ a set of cind’s. Then, Algorithm 5.3.5 computes
the maximally contained positive rewriting of Q under Σ in terms of S under the
rewrite systems semantics. 
Example 5.3.7 (“Coffee Can Problem” [DJ90]) Consider the rewrite system
black white → black white black → black black black → white
with symbols “white” and “black” and the input word
w = (white white black black white white black black)
where the goal is to replace sequences of symbols of that word that match the
left hand side of one of the three productions listed above repeatedly to produce
a rewriting that is as small as possible. One such sequence of replacements is
(0) white white black black white white black black
(1) white white black black white black black
(2) white white white white black black
(3) white white white black black
(4) white white black black
(5) white black black
(6) black black
(7) white
Pairs of occurrences of the symbols “black” or “white” have been underlined
immediately before their replacement. Thus, the input string can be rewritten
into a word with a single symbol, “white”.
We can simulate such behavior using query rewriting under the rewrite sys-
tems semantics. Let us search for one-symbol rewritings. We model an n-symbol
word w ∈ {black, white}n as a query of the form
q(x1 ) ← start end(x1 , xn+1 ), p1 (x1 , x2 ), . . . , pi (xi , xi+1 ), . . . , pn (xn , xn+1 ).
where pi is either “black” or “white” and x1 . . . xn1 are variables. The above input
word is thus represented as
84 CHAPTER 5. QUERY REWRITING

q(x1 ) ← start end(x1 , x9 ),
white(x1 , x2 ), white(x2 , x3 ), black(x3 , x4 ), black(x4 , x5 ),
white(x5 , x6 ), white(x6 , x7 ), black(x7 , x8 ), black(x8 , x9 ).

The rewrite system can be encoded as a set of cind’s
{hx, yi | ∃z : black(x, z) ∧ white(z, y)} ⊇ {hx, yi | black(x, y)}
{hx, yi | ∃z : white(x, z) ∧ black(z, y)} ⊇ {hx, yi | black(x, y)} (⋆)
{hx, yi | ∃z : black(x, z) ∧ black(z, y)} ⊇ {hx, yi | white(x, y)}
Furthermore, we define two source predicates w src and b src and define cind’s
responsible for making the rewrite process terminate with “success” (i.e., a con-
tained rewriting in terms of the source predicates is found).
{hxi | ∃y : start end(x, y) ∧ white(x, y)} ⊇ {hxi | w src(x)}
{hxi | ∃y : start end(x, y) ∧ black(x, y)} ⊇ {hxi | b src(x)}
It can be verified by applying the above algorithm (although this is a quite work-
intensive task) that the maximally contained rewriting under the rewrite systems
semantics is

q(x1 ) ← w src(x1 ).

In fact, the seven-step sequence of replacements shown above can be easily
used to create a proof in our rewrite systems semantics that q ′ is in the maxi-
mally contained rewriting. For the first replacement of that sequence, the tuple
hc1 , . . . , cn i ∈ (C ∪ {ǫ})n of Algorithm 5.3.5 would equal hǫ, ǫ, ǫ, cσ2 ,1 , cσ2 ,2 , ǫ, ǫ, ǫi
where cσ2 ,1 and cσ2 ,2 are the first and second Horn clause created by normalizing
our second cind (⋆). We can conclude that the above rewrite system cannot result
in a one-symbol rewriting “black” for the given input word. 

5.3.3 Equivalence of the two Semantics
Theorem 5.3.8 Let Q be a conjunctive query, Σ be a set of cind’s, and S be a
set of “source” predicates. Then, the maximally contained rewriting under the
classical logical semantics and Σ in terms of S and its analog under the rewrite
systems semantics coincide. 

For showing this, we first establish the following auxiliary result.

Lemma 5.3.9 Let P be a resolution proof establishing a logically contained
rewriting of a conjunctive query Q under a set of cind’s Σ. Then, there is always
a proof P ′ establishing the same contained rewriting such that each intermediate
rewriting is function-free. 
5.3. SEMANTICS 85

Proof. Let us assume that each new subgoal a derived using resolution receives
an identifying index idx(a). Then, given the proof P, there is a unique next
premise to be applied cidx(a) out of the Horn clauses in the normalization of Σ for
each subgoal a. This is the Horn clause from our constraints base that will be
unfolded with a to resolve it in P.
Note that the proof P is fully described by the indexes of subgoals in the (body
of the) original query Q, some unique indexing of subgoals somewhere created
later on in the proof (while we do not need to know the atoms themselves), the
clauses cidx(a) , and which indexes the subgoals in the bodies of these clauses are
attributed with when they are unfolded with subgoals.
In our original proof P, each subgoal a of a goal is rewritten with cidx(a) in
each step, transforming g0 , the body of Q and the initial goal, via g1 , . . . , gn−1 to
gn , the body of the resulting rewriting. We maintain the head of Q separately
across resolution steps and require that variables in the head are not unified with
function terms, but apply other unifications effected on the variables in the goals
in parallel with the rewriting process. Already P must assure at any step that
no variable from the head of Q is unified with a function term, as otherwise no
conjunctive query can result.
We know that resolution remains correct no matter in which order the next
due resolution steps cidx(a) are applied to the subgoals, and that we even may
unfold, given e.g. a goal with two atoms, the first goal and then a subgoal from
the unfolding of that first goal (and may do that any finite number of times)
before we unfold our second original subgoal.
Coming back to deriving a function-free proof starting from P, all we now
have to show is that at any intermediate step of a resolution proof with cind’s, a
nonempty set of subgoals S = {ai1 , . . . , aik } ⊆ gi of the function-free intermediate
goal gi exists such that, when only these subgoals are unfolded with their next due
premises to be applied cidx(ai1 ) , . . . , cidx(aik ) , the overall new goal gi+1 produced
will be function-free8 . The emphasis here lies on finding a nonempty such set
S, as the empty set automatically satisfies this condition. If we can guarantee
that such a nonempty set always exists until the function-free proof has been
completed, our lemma is shown.
Let there be a dependency graph Ggi = hV, Ei for each intermediate goal
gi with the subgoals as vertices and a directed edge ha, bi ∈ E iff a contains a
variable v that is unified with a function term f (X̄) in Head(cidx(a) ) and v appears
in b and is unified with a variable (rather than a function term with the same
function symbol) in Head(cidx(b) ). (Intuitively, if there is an edge ha, bi ∈ E, then
b must be resolved before a if a proof shall be obtained in which all intermediate
goals are function-free.) As mentioned, query heads are guaranteed to remain
function-free by the correctness of P. For instance, the dependency graph of the
8
The correctness of the proof P alone assures that the query head will be function-free as
well.
86 CHAPTER 5. QUERY REWRITING

goal

← a(x)(0) , b(x, y)(1) , c(y, z)(2) , d(z, w)(3) .

with
c0 : a(x) ← a′ (x). c1 : b(f (x), x) ← b′ (x).
c2 : c(x, x) ← c′ (x). c3 : d(g(x), x) ← d′ (x).
would be G = h{0, 1, 2, 3}, {h0, 1i, h2, 3i}i.
We can now show that such a dependency graph G is always acyclic. In fact,
if it were not, P could not be a valid proof, because unification would fail when
trying to unify a variable in such a cycle with a function term that contains that
variable. This is easy to see because each function term given our construction
used for obtaining Horn clauses from cinds contains all variables appearing in
that same (head) atom. Consider for instance

q(x) ← a(x, y), a(y, z), b(w, z), b(z, y).

{hx, yi | ∃z : a(x, z) ∧ a(z, y)} ⊇ {hx, yi | b(x, y)}
{hx, yi | ∃z : b(x, z) ∧ b(z, y)} ⊇ {hx, yi | a(x, y)}

There is no rewriting under our two semantics, because the dependency graph of
our above construction is cyclic already for our initial goal, the body of q.
However, since G is acyclic, we can unfold a nonempty set of atoms (those
unreachable from other subgoals in graph G) with our intermediate goals until
the proof has been completed. 
Proof of Theorem 5.3.8. It is easy to see that the rewriting process for
finding maximally contained rewritings under the rewrite systems semantics is
equivalent to resolution where only some of the subgoals of a goal may be rewrit-
ten in a single step and each intermediate rewriting has to be function-free.
Assume that a proof establishing a single contained conjunctive query is
known for the rewrite systems semantics. Then, this is also a proof for the
classical semantics, and inclusion in this direction is shown.
The other direction follows from Lemma 5.3.9. Given a resolution proof P
that a conjunctive query Q′ is a contained rewriting of Q, we can always construct
an analogous proof of this from P for the rewrite systems semantics.
From this equivalence of resolution proofs and proofs with function-free inter-
mediate steps we conclude that the overall search process for maximally contained
rewritings under both semantics is guaranteed to lead to equal results. 

Example 5.3.10 Given a boolean conjunctive query q ← b(x, x, 0). and the
following set of Horn clauses which, as is easy to see, are equivalent to and the
normalization of a set of cind’s that we do not show in order to reduce redundancy.
5.3. SEMANTICS 87

b(x′ , y ′, s0 ) ← a(x, y, s2 ) ∧ eǫ (x, x′ ) ∧ e1 (y, y ′). c0
b(x′ , y ′, s2 ) ← a(x, y, s0 ) ∧ e1 (x, x′ ) ∧ e0 (y, y ′). c4 , c10 , c11
b(x′ , y ′, s0 ) ← a(x, y, s1 ) ∧ e0 (x, x′ ) ∧ eǫ (y, y ′). c12 , c18 , c19
b(x′ , y ′, s1 ) ← a(x, y, s0 ) ∧ e1 (x, x′ ) ∧ e1 (y, y ′). c20 , c25

eǫ (x, x) ← v(x). c2 , c17
e1 (x, f1 (x)) ← v(x). c3 , c8 , c23 , c24
e0 (x, f0 (x)) ← v(x). c2 , c17
v(x) ← b(x, y, s). c5 , c13 , c21
v(y) ← b(x, y, s). c6 , c14
a(x, y, s) ← b(x, y, s). c1 , c7 , c15
where x, y, x′ , y ′ are variables and s0 , s1 , s2 are constants. Let P be the resolution
proof

(0) ← b(x, x, 0)(0) .
(1) ← a(x, y, 2)(1) , eǫ (x, z)(2) , e1 (y, z)(3) .
(2) ← b(f1 (y), y, 2)(4), v(f1 (y))(5) , v(y)(6) .
(3) ← a(x1 , y1 , 0)(7) , e1 (x1 , f1 (y))(8) , e0 (y1 , y)(9) ,
b(f1 (y), v1, 2)(10) , b(v2 , y, 2)(11). †10 , †11
(4) ← b(f0 (y1 ), y1 , 0)(12) , v(f0 (y1 ))(13) , v(y1 )(14) .
(5) ← a(x2 , y2 , 1)(15) , e0 (x2 , f0 (y1 ))(16) ,
eǫ (y2 , y1)(17) , b(f0 (y1 ), v1 , 0)(18) ,
b(v2 , y1 , 0)(19) . †18 , †19
(6) ← b(y1 , y1 , 1)(20) , v(y1)(21) .
(7) ← a(x, x, 0)(22) , e1 (x, f1 (x))(23) ,
e1 (x, f1 (x))(24) , b(y1 , v1 , 1)(25) . †25
(8) ← a(x, x, 0)(22) , v(x)(26) .

which rewrites our query into q ← a(x, x, 0), v(x). and in which we have su-
perscribed each subgoal with its assigned index. To keep things short, we have
eliminated subgoals (marked with a dagger † and their index) that are redundant
with a different branch of the proof. As claimed in our theorem, P can be trans-
formed into the following proof in which each intermediate step is function-free.

(0) ← b(x, x, 0)(0) .
(1) ← a(x, y, 2)(1) , eǫ (x, z)(2) , [e1 (y, z)(3) ].
(2) ← b(x, y, 2)(4) , v(x)(5) , [e1 (y, x)(3) ].
(3) ← a(x1 , y1 , 0)(7) , e1 (x1 , x)(8) , e0 (y1 , y)(9) ,
b(x, v1 , 2)(10) , [e1 (y, x)(3) ]. †10
(4) ← a(x1 , y1 , 0)(7) , e1 (x1 , x)(8) ,
[e0 (y1 , y)(9) ], 6 [e1 (y, x)(3)6 ].
(5) ← b(y, y1, 0)(12) , v(y)(14) , [e0 (y1 , y)(9) ].
88 CHAPTER 5. QUERY REWRITING

(6) ← a(x2 , y2 , 1)(15) , e0 (x2 , y)(16) , eǫ (y2 , y1 )(17) ,
b(y, v1 , 0)(18) , 6 [e0 (y1, y)(9)6 ]. †18
(7) ← b(y1 , y1 , 1)(20) , v(y1)(21) .
(8) ← a(x3 , y3 , 0)(22) , e1 (x3 , y1)(23) ,
e1 (y3, y1 )(24) , b(y1 , v1 , 1)(25) . †25
(9) ← a(x3 , x3 , 0)(22) , v(x3 )(26) .

The subgoals that we have marked with brackets [ ] had been blocked at a certain
step to keep the proof function-free. 

Of course this correspondence between function-free and general resolution
proofs does not hold for Horn clauses in general.

Example 5.3.11 Consider the boolean query

q ← a1 (u, v), b1(u, v).

and the Horn clauses

a1 (f (x), y) ← a2 (x, y). a2 (x, g(y)) ← a3 (x, y).

b1 (x, g(y)) ← b2 (x, y). b2 (f (x), y) ← b3 (x, y).
These entail

q ← a3 (x, y), b3 (x, y).

although one cannot arrive at a function-free intermediate rewriting by either
unfolding the left (which would result in q ← a2 (x, y), b1 (f (x), y).) or right
subgoal (which would result in q ← a1 (x, g(y)), b2(x, y).) of our query first,
neither by unfolding both at once (resulting in q ← a2 (x, g(y)), b2(f (x), y).). 

5.3.4 Computability
Theorem 5.3.12 Let Σ be a set of cind’s and Q and Q′ be conjunctive queries.
Then the following problems are undecidable:

• Σ  Q ⊆ Q′ , the containment problem.

• ∃Q′ : Σ  Q ⊇ Q′ , i.e. it is undecidable whether the maximally contained
rewriting of a conjunctive query Q under the classical logical semantics is
nonempty (that is, it contains at least one conjunctive query)9 . 
9
By Theorem 5.3.8, this is equivalent to the following problem: Given a conjunctive query
Q, is the maximally contained rewriting under the rewrite systems semantics nonempty?
5.3. SEMANTICS 89

We also give an intuition for the undecidability results of Theorem 5.3.12.
Post’s Correspondence Problem (PCP, see e.g. [HU79]), a simple and well-known
undecidable problem, is defined as follows. Given nonempty words x1 , . . . , xn and
y1 , . . . , yn over the alphabet {0, 1}, the problem is to decide whether there are
indexes i1 , . . . , ik (with k > 0) s.t. xi1 xi2 . . . xik = yi1 yi2 . . . yik . In the following
example, we show, by an example, an encoding of PCP in terms of our query
rewriting problem.
In fact, Example 5.3.10 already presented an encoding for PCP that shows
the undecidability of query rewriting with cind’s10 . In the following example, we
provide another one which is simpler.

Example 5.3.13 Given are a source s, a boolean query q ← inc(0, 0). and
the following five cind’s
{hx, yi | dec(x, y)} ⊆ {hx, yi | ∃x1 , y1 : zero(x, x1 ) ∧ zero(y, y1) ∧ inc(x1 , y1)} (1)
{hx, yi | dec(x, y)} ⊆ {hx, yi | ∃x1 , y1 : zero(x, x1 ) ∧ zero(y, y1) ∧ dec(x1 , y1)} (2)
{hx, yi | dec(x, y)} ⊆ {hx, yi | ∃x1 , y1 : one(x, x1 ) ∧ one(y, y1) ∧ inc(x1 , y1 )} (3)
{hx, yi | dec(x, y)} ⊆ {hx, yi | ∃x1 , y1 : one(x, x1 ) ∧ one(y, y1) ∧ dec(x1 , y1)} (4)
dec(0, 0) ← s. (5)
that constitute the core encoding and two constraints

inc(x, y) ← one(x, x1 ), zero(x1 , x2 ), one(x2 , x3 ),
one(y, y1), inc(x3 , y1 ). (6)

inc(x, y) ← one(x, x1 ), zero(y, y1), one(y1 , y2),
one(y2 , y3 ), one(y3, y4 ), zero(y4 , y5 ),
inc(x1 , y5 ). (7)
that stand for a PCP problem instance with two pairs of words,

I = {hx1 = 101, y1 = 1i, hx2 = 1, y2 = 01110i}

The constraints (1) – (4) can be considered to have a role of “guessing” a solu-
tion to the PCP problem, constraints (6) and (7) to have a role of “checking”
the solution, and constraint (5) the role of “terminating” when the search was
successful.
10
Example 5.3.10 is an encoding of PCP with the instance

I = {hx1 = 10, y1 = 1i, hx2 = 1, y2 = 01i}

The instance itself is encoded in the first four Horn clauses only. The encoding, while more
complicated than the one presented in this section, allows to show the undecidability of query
rewriting (A PCP instance is satisfiable if and only if the maximally contained rewriting of the
query q ← b(x, x, 0). is nonempty.) as well as the undecidability of query containment under a
set of cind’s (A PCP instance is satisfiable iff {hi | ∃x : v(x) ∧ a(x, x, 0)} ⊆ {hi | ∃x : b(x, x, 0)}).
90 CHAPTER 5. QUERY REWRITING

For showing the PCP instance satisfiable, one can compute a contained rewrit-
ing by applying the constraints in the following order (we only describe the
proof but no dead-end branches): (guess phase) (6), (7), (6), (check phase) (3),
(2), (4), (4), (4), (2), (4), (termination) (5). The maximally contained rewrit-
ing is nonempty because there is a solution to this particular PCP instance,
x1 x2 x1 = y1 y2 y1 = 1011101. 

5.3.5 Complexity of the Acyclic Case
For the important case that Σ is acyclic, the two above problems are decidable
(and NEXPTIME -complete). We first establish the following auxiliary result.

Lemma 5.3.14 Let Σ be an acyclic set of cind’s and Q and Q′ be conjunctive
queries. Then the containment problem Σ  Q ⊆ Q′ and the problem of deciding
whether the maximally contained rewriting of Q (as a set of conjunctive queries)
is nonempty are NEXPTIME-hard. 

Proof. NEXPTIME-hardness follows from a slightly altered form of the
encoding of the NEXPTIME-complete Tiling problem (see e.g. [Pap94]) used in
[DV97] to show NEXPTIME-hardness of the SUCCESS problem for nonrecursive
logic programming.
TILING is the problem of tiling the square of size 2n × 2n by tiles – squares of
size 1 × 1 – of k types. There are two binary relations on and to defined on the
tiles. Tiles ti and tj are said to be horizontally compatible if hti , tj i ∈ to holds
and are called vertically compatible if hti , tj i ∈ on. A tiling of the square of size
2n × 2n is a function f : {1, . . . , 2n } × {1, . . . , 2n } → {t1 , . . . , tk } s.t. vertically
and horizontally neighboring tiles are compatible, i.e.

hf (i, j), f (i + 1, j)i ∈ to . . . for all 1 ≤ i < 2n , 1 ≤ j ≤ 2n

and

hf (i, j), f (i, j + 1)i ∈ on . . . for all 1 ≤ i ≤ 2n , 1 ≤ j < 2n

The TILING problem is defined as follows. Suppose that we are given a set
{t1 , . . . , tk } of tiles, compatibility relations on and to, and a number n written
in unary notation, the problem is to decide whether there exists a tiling f of the
square of size 2n × 2n with a distinguished tile type, say t1 , at the top left corner
(i.e., f (1, 1) = t1 ).
We describe a reduction that transforms any instance of the tiling problem to
an instance of the containment problem of conjunctive queries under an acyclic
set of cind’s and which requires only polynomial time relative to the size of the
problem instance.
5.3. SEMANTICS 91

x1 x2 x2 y1 y1 y2
x3 x4 x4 y3 y3 y4
x1 x2 y1 y2 x3 x4 x4 y3 y3 y4
x3 x4 y3 y4 z1 z2 z2 u1 u1 u2
z1 z2 u1 u2 z1 z2 z2 u1 u1 u2
z3 z4 u3 u4 z3 z4 x4 u3 u3 u4

Figure 5.1: Hypertile of size i ≥ 2 (left) and the nine possible overlapping hyper-
tiles of size i − 1 (right).

We define hypertiles as follows. Each composition of 2 × 2 tiles or hypertiles is
a hypertile if the component tiles satisfy the compatibility constraints. Obviously,
all hypertiles are of size 2i × 2i for some i ≥ 1 [DV97]. In our encoding, we define
hypertiles of level 1 by the following cind

{hx1 , x2 , x3 , x4 i | ∃xf : til1 (xf , x1 , x2 , x3 , x4 , x1 )} ⊇
{hx1 , x2 , x3 , x4 i | to(x1 , x2 ) ∧ to(x3 , x4 ) ∧ on(x1 , x3 ) ∧ on(x2 , x4 )}

Fortunately, for hypertiles of level i ≥ 2, it is not necessary to enforce that all the
compatibility constraints are satisfied on the level of tiles. Instead, it is sufficient
to verify that all of the nine possible (overlapping) constituent hypertiles of the
next-smaller level i − 1 (see Figure 5.1) satisfy the compatibility constraints. We
define hypertiles of level greater than one by

{hxf , yf , zf , uf , ti | ∃f : tili+1 (f, xf , yf , zf , uf , t)} ⊇
{hxf , yf , zf , uf , ti | ∃ x1 , . . . , x4 , y1 , . . . , y4 , z1 , . . . , z4 , u1 , . . . , u4 , d1 , . . . , d13 :
tili (xf , x1 , x2 , x3 , x4 , t) ∧
tili (yf , y1 , y2, y3 , y4 , d1) ∧
tili (zf , z1 , z2 , z3 , z4 , d2 ) ∧
tili (uf , u1, u2 , u3 , u4, d3 ) ∧
tili (d4 , x2 , y1, x4 , y3 , d5) ∧
tili (d6 , x4 , y3, z2 , u1, d7 ) ∧
tili (d8 , z2 , u1, z4 , u3, d9 ) ∧
tili (d10 , x3 , x4 , z1 , z2 , d11 ) ∧
tili (d12 , y3 , y4 , u1, u2 , d13 )}

Let bot be a nullary predicate. To complete our encoding, we add cind’s
on(ti , tj ) ← bot. for each hti , tj i ∈ on and to(ti , tj ) ← bot. for each hti , tj i ∈ to.
where ti and tj are constants identifying pairs out of the k given tile types.
Let us consider the encoding shown above as a logic program (that we obtain
by normalizing the cind’s). The existential variables in the subsumer queries of
the tili cind’s will be transformed into function terms aggregating the 4 hypertiles
of the next smaller size. (In fact, also the variables for the top left corner tiles t
92 CHAPTER 5. QUERY REWRITING

will be aggregated in the function terms, but this does not alter the correctness
of the encoding.) The cind for til1 is transformed into the Horn clause

til1 (f1 (x1 , x2 , x3 , x4 ), x1 , x2 , x3 , x4 , x1 ) ←
to(x1 , x2 ), to(x3 , x4 ), on(x1 , x3 ), on(x2 , x4 ).

and the cind’s for tili≥2 are normalized as Horn clauses with heads

tili (fi (x1 , x2 , x3 , x4 , t), x1 , x2 , x3 , x4 , t)

During bottom-up evaluation of such a logic program, the function terms
constructed using fi correspond exactly with the valid hypertiles constructible
from the given k tile types, if the fifth arguments of function terms of symbols
fi≥2 are ignored.
It is quite easy to see that there is a solution for the TILING problem iff the
constraints in our encoding entail

{hi | bot} ⊆ {hi | tilm (f, x, y, z, u, 1)}

Equally, there is a solution to the TILING problem exactly if the maximally
contained rewriting of {hi | tilm (f, x, y, z, u, 1)} in terms of the “source predicate”
bot is nonempty. Thus, these two problems are NEXPTIME-hard. 

Theorem 5.3.15 Let Σ be an acyclic set of cind’s and Q and Q′ be conjunctive
queries. Then the containment problem Σ  Q ⊆ Q′ and the query rewriting
problem for conjunctive queries (under acyclic sets of cind’s) are NEXPTIME-
complete. 

Proof. As pointed out in Section 5.3.1, the query containment problem
under an acyclic set of cind’s can be solved by proving the unsatisfiability of the
negation of the containment, which decomposes into a set of ground facts (in
analogy with canonical database of the “freezing trick” of Example 2.2.3) and
a goal. This is a special case of the SUCCESS problem for nonrecursive logic
programs [DV97, VV98].
The problem of deciding whether query rewriting produces a nonempty set
of conjunctive queries can be reduced to the SUCCESS problem by introducing
unit clauses si (x1 , . . . , xni ) ←. (where x1 , . . . , xni are distinct variables) for each
“source” predicate si of arity ni .
As both problems are known NEXPTIME-hard from Lemma 5.3.14, com-
pleteness in NEXPTIME has been shown. 
This result shows that by restricting ourselves to acyclic sets of cind’s we
have nevertheless retained all the expressive power for decision-making (modulo
polynomial transformations) of nonrecursive logic programming.
5.4. IMPLEMENTATION 93

5.4 Implementation
Our implementation is based on Algorithm 5.3.5, but makes use of several op-
timizations. Every time an MCD m is unfolded with a query to produce an
intermediate rewriting Q, we compute a query Q′ as follows.
Body(Q′) := {Bodyi (Q) | mi 6= ǫ}
Head(Q′ ) := hX̄i s.t. each
xi ∈ V ars(Head(Q)) ∩ V ars(Body(Q′))
Q′ is thus created from the new subgoals of the query that have been intro-
duced using the MCD. If Q′ contains non-source predicates, the following check
is performed. We check if our rewriting algorithm produces a nonempty rewriting
on Q′ . This is carried out in depth-first fashion. If the set of cind’s is cyclic, we
use a maximum lookahead distance to assure that the search is finite. If Q′ is not
further rewritable, Q is not processed any further but is dropped.
Subsequently, (intermediate) rewritings produced by unfolding queries with
MiniCon descriptions are simplified using tableau minimization.
Directly after parsing, Horn clauses whose head predicates are unreachable
from the predicates of the query are filtered out. The same is done with clauses
not in the set X computed by
X := ∅;
do X := X ∪ {c ∈ C | P reds(Body(c)) ⊆
Sources ∪ {P red(Head(c′)) | c′ ∈ X}};
while X changed;
We have implemented the simple optimizations known from the Bucket Algo-
rithm [LRO96] and the Inverse Rules Algorithm [GKD97] for answering queries
using views which are used to reduce the branching factor in the search process.
Beyond that, MiniCon descriptions are computed with an intelligent backtrack-
ing method that always chooses to cover subgoals first for which this can be done
deterministically (i.e., the number of Horn clauses that are candidates for un-
folding with a particular subgoal can be reduced to one), thereby reducing the
amount of branching.
Our unification algorithm allows to pre-specify variables that may in no case
be unified with a function term (e.g., for head variables of queries or atoms
already over source predicates). This allows to detect the impossibility to create
a function-free rewriting as early as possible.
In the implementation of the deterministic component of our algorithm for
generating MiniCon descriptions, we first check whether the corresponding pairs
of terms of two atoms to match unify independently before doing full unification.
This allows to detect most violations with very low overhead. Given an appro-
priate implementation, it is possible to check this property in logarithmic or even
constant time.
94 CHAPTER 5. QUERY REWRITING

An important performance issue in Algorithm 5.3.5 is the fact that MCDs are
only applied one at a time, which leads to redundant rewritings as e.g. the same
MCDs may be applicable in different orders (as is true for the classical problem
of answering queries using views, a special case) and thus a search space that
may be larger than necessary. We use dependency graph-based optimizations to
check if a denser packing of MCDs is possible. For the experiments with layered
sets of cind’s reported on in Section 5.5 (Figures 5.3 and 5.4), MCDs are packed
exactly as densely11 as in the MiniCon algorithm of [PL00].

Distribution
The implementation of our query rewriter (with algorithms for both semantics
presented) consists of about 9000 lines of C++ code. Binaries for several plat-
forms as well as examples and a Web demonstrator that allows to run limited-size
problems online are made available on the Web at

http://cern.ch/chkoch/cindrew/

5.5 Experiments
A number of experiments have been carried out to evaluate the scalability of our
implementation. These were executed on a 600 MHz dual Pentium III machine
running Linux. A benchmark generator was implemented that randomly gener-
ated example queries and sets of cind’s. This program created chain as well as
random queries (and cind’s).
In all experiments, the queries had 10 subgoals, and we averaged timings over
50 runs. Sets of cind’s were always acyclic. This was ascertained by the use of
predicate indexes such that the predicates in a subsumer query of a cind only used
indexes greater that or equal to a random number determined for each cind, and
subsumed queries only used indexes smaller than that number. Times for parsing
the input were excluded from the diagrams, and redundant rewritings were not
eliminated12 . Diagrams relate reasoning times on the (logarithmic-scale) vertical
axis to the problem size as a number of cind’s on the horizontal axis.

5.5.1 Chain Queries
Chain queries are conjunctive queries of the form

q(x1 , xn+1 ) ← p1 (x1 , x2 ), p2 (x2 , x3 ), . . . , pn1 (xn−1 , xn ), pn (xn , xn+1 ).
11
See Section 3.6.2.
12
Note that CindRew optionally can make rewritings nonredundant and minimal. However,
for these experiments, these options were not active.
5.5. EXPERIMENTS 95

seconds
10
p=12
classical, p=16

1 p=8

p=16
0.1

0.01

0.001 chain queries
unlayered
3−6 subgoals per query

#cind’s
0.0001
0 500 1000 1500 2000 2500 3000

Figure 5.2: Experiments with chain queries and nonlayered chain cind’s.

Thus, chain queries are constructed by connecting binary predicates via vari-
ables, as shown above. In our experiments, the distinguished (head) variables
were the first and the last. The chain cind’s had between 3 and 6 subgoals in
both the subsuming and the subsumed queries.
We report on three experiments with chain queries.
The first diagram (Figure 5.2) shows timings for chain queries. By the steep
line on the left we report on an alternative query rewriting algorithm that we
have implemented and which follows the classical semantics, and a traditional
resolution strategy, where we unfold certain clauses where this is deemed appro-
priate as described in Algorithm 5.3.3. This is particularly effective with acyclic
sets of constraints that are as densely packed, as is the case here. The experiment
reported on here was carried out with 16 predicates. This algorithm is compared
to and clearly outperformed by CindRew (with three different numbers of pred-
icates; 8, 12, and 16). Since the more predicates are available, the sparser the
constraints get, more predicates render the query rewriting process simpler.
In the second diagram (Figure 5.4), we report on CindRew’s execution times
with cind’s that were generated with an implicit layering of predicates (with
2 layers). This experiment is in principle very similar to local-as-view rewriting
with p/2 global predicates and p/2 source predicates (where the subsumer queries
of cind’s correspond to logical views in the problem of answering queries using
views), followed by simple view unfolding to account for the subsumed queries of
cind’s. We again report timings for three different numbers of predicates.
In the third diagram, the same problem of finding a maximally contained
96 CHAPTER 5. QUERY REWRITING

seconds
10

p=8 p=12 p=16

1

0.1

0.01
chain queries
3−6 predicates per query
2 layers of predicates

#cind’s
0.001
0 500 1000 1500 2000 2500 3000

Figure 5.3: Experiments with chain queries and two layers of chain cind’s.

seconds
101
p=20 p=40

100

10−1

10−2

10−3

chain queries
10−4 3−6 predicates per query
5 layers of predicates

#cind’s
10−5
0 1000 2000 3000 4000 5000 6000

Figure 5.4: Experiments with chain queries and five layers of chain cind’s.
5.5. EXPERIMENTS 97

seconds
10

1

random queries
predicate arity: 2−3
subsumers: 3−4 subgoals
subsumees: 2 subgoals
distinguished variables: 1−2
5 layers of predicates

#cind’s
0.1
0 500 1000 1500 2000 2500

Figure 5.5: Experiment with random queries.

rewriting is solved for 20 and 40 predicates, which are grouped into a stack of
five layers of 4 and 8 predicates each, respectively. Of the five sets of predicates,
one constitutes the sources and one the “schema” over which queries are asked,
and four equally sized sets of cind’s bridge between these layers13 .
As can be seen by comparing the second and third diagrams with the first, the
hardness of the layered problems is more homogeneous. Particularly in Figure 5.2
and Figure 5.3, one can also observe subexponential performance. Note that in
the experiment of Figure 5.4, timings were taken in steps of 20 cind’s, while in
the other experiments, this step length was 100.

5.5.2 Random Queries
The random queries had either three or four subgoals in the subsumer and two
subgoals in the subsumed query. Predicates either had arity two or three, and
the number of distinguished variables was either one or two. The number of
existential queries was two to three times as high in order to reduce the number
of correct solutions. For the experiments carried with random queries, the number
of solution rewritings quickly got very large, so we report on computing at most
100 rewritings14 . Figure 5.5 shows the timings for random queries as described
13
See Section 5.2 for our definition of layered sets of cind’s.
14
In the runs with chain queries (and constraints), we of course computed all rewritings.
98 CHAPTER 5. QUERY REWRITING

earlier, with five predicates and five layers (i.e., one predicate per layer). We
report only on the case with five layers because e.g. in the case where there were
no layers or fewer layers, computing the first 100 solutions was too easy.

5.6 Discussion
This chapter has addressed the query rewriting problem in data integration from
a fresh perspective. Expressive symmetric constraints are used, which we have
called Conjunctive Inclusion Dependencies. The problem of computing the max-
imally contained rewritings was studied under two justifiable semantics. We have
discussed their main theoretical properties and have shown that they coincide.
We have presented the second semantics, motivated by rewrite systems, as a valu-
able alternative to the classical logical. This semantics allows to apply time-tested
(e.g., tableau minimization) as well as more recent (e.g., the MiniCon algorithm)
techniques and algorithms from the database field to the query rewriting problem.
There are several advantages of algorithms following the philosophy of the
rewrite systems semantics for query rewriting. Under this semantics, intermedi-
ate results are (function-free) queries and can be immediately made subject to
query optimization techniques known in the database field. As a consequence,
further query rewriting may start from simpler queries, leading to an increase in
performance and fewer redundant results that later have to be found and elimi-
nated. Thus, it is often possible to detect dead ends early. As a trade-off (as can
be seen in Algorithm 5.3.5), an additional degree of nondeterminism is introduced
compared to resolution-based algorithms under the classical semantics.
In the context of data integration, there are usually a number of regulari-
ties in the way constraints are implemented and queries are posed. Usually we
expect to have a number of schemata, each containing a number of predicates.
Between the predicates of one schema, no constraints for data integration uses
are defined. Moreover, we expect inter-schema constraints usually to be of the
form Q1 ⊆ Q2 where most (or all) predicates in Q1 belong to one and the same
schema, while the predicates of Q2 belong to another. Queries issued against the
system are usually formulated in terms of a single schema. Given these assump-
tions, we suspect algorithms following the rewrite systems semantics which apply
optimization techniques from the database area on intermediate results to have a
performance advantage over classical resolution-based algorithms, which do not
exploit such layering heuristics.
Clearly, the noncomputability of rewritings in general is an important prob-
lem, which will be addressed in the following chapter. In particular we will argue
that one can often avoid to have cyclic definitions of cind’s in a system for finding
maximally contained rewritings15 .
15
Note that it is only reasonable to talk about equivalent rewritings if cind’s are allowed to
be cyclic.
Chapter 6

Model Management

In the previous chapter, a detailed presentation of the query rewriting problem
with conjunctive inclusion dependencies has been given. Such inter-schema con-
straints are not only highly relevant to data integration because of their ability
to deal with concept mismatch that requires symmetric constraints (see Exam-
ple 1.3.1). This class of constraints also supports the construction of mappings
that are robust with respect to change.
This chapter starts with the definition of a very simple model for managing
relational schemata and mappings based on cind’s in a repository. Schemata are
simply sets of relations, without any additional semantics such as those commonly
encoded as functional dependencies or inclusion dependencies (e.g., unique keys
and foreign key constraints). This restriction, together with the one that we
confine this presentation to a purely relational rather than semantic or object-
oriented data model, makes this study a mostly theoretical one. However, this
allows us to concentrate on the main issues of supporting maintainability in a
concise way. Furthermore, such extensions are reasonably simple to realize1 .
In the second section of this chapter we discuss the problem of designing map-
pings that are robust and do not easily become complete failures when the data
integration requirements change. Finally, we attack the problem of arriving at
collections of inter-schema constraints that are acyclic, a property that Chapter 5
has shown to be desirable.

6.1 Model Management Repositories
A model management repository is a pair hR, Mi of a set of relational schemata
R and a set of mappings M. For simplicity, we consider schemata as simple sets
of relation schemata without dependencies (which of course could be added to
the mechanism but which we leave out for simplicity). Each relational predicate
1
See Section 2.5 on the issue of applying our work on query rewriting to object-oriented
queries and Chapter 7 on the generalization of the query rewriting problem.

99
100 CHAPTER 6. MODEL MANAGEMENT

1. Add an empty schema R = ∅ to R.

2. Copy the predicates of a schema R ∈ R into a new schema R′ . Mappings
of R are not linked to R′.

3. Add a predicate to schema R. Predicates are identified by name within a
schema and have a unique fixed arity.

4. Rename a predicate p of a schema R, as well as all of its occurrences in
mappings.

5. Change the arity of a predicate p in a schema. (a) To add an attribute
to p at position i, each of its appearances in cind’s is augmented with a
new existential variable for that attribute. (b) An attribute may only be
removed from p if for every appearance of p in a cind, there is a variable
in this attribute position which is existentially quantified and not used in
a join.

6. Delete a predicate p from a schema. This requires that p is not used in any
mapping.

7. Delete a schema. This is only allowed if no predicate from the schema is
used in any mapping.

8. Import a schema (from DDL, IDL, and DTD files, relational databases,
spreadsheet layouts, . . .)

Figure 6.1: Operations on schemata.

1. Add an elementary mapping Σ to M.

2. Add a cind Q1 ⊇ Q2 to an elementary mapping Σ from schemata R1 , . . . , Rn
to schema R. Q1 must be over predicates in R and Q2 over predicates in
R1 , . . . , Rn .

3. Remove a cind from an elementary mapping Σ.

4. Delete a mapping. In case of a composite mapping, all of its constituents
are deleted, including auxiliary schemata.

Figure 6.2: Operations on mappings.
6.1. MODEL MANAGEMENT REPOSITORIES 101

1. UNFOLDM (R, ΣGAV)
Rewrite a schema R using a set of GAV views ΣGAV to achieve a finer
granularity of entities contained. For each view

p(X̄) ← p1 (X̄1 ), . . . , pn (X̄n )

in ΣGAV , let p be a predicate in R and p1 , . . . , pn be new predicate names.
p is replaced in R by {p1 , . . . , pn } and all subsumer or subsumee queries of
cind’s in M that contain p are unfolded with the GAV view.

2. MERGEM (R1 , R2, R′ )
Merge two schemata R1 and R2 into a new schema R′ . This can be done if
R1 and R2 do not contain predicates of the same name but with different
arities and if there are no dependencies (via mappings) between any of the
predicates in R1 and R2 . Predicates from R1 and R2 with the same names
fall together. All predicates from R1 and R2 occurring in mappings in M
are replaced by the corresponding predicates from R′ .

3. SPLITR,M (R, {R1, . . . , Rm }, {Rm+1, . . . , Rn }, R′)
Distribute the role of a schema R in the data integration infrastruc-
ture across R and a new schema R′ . Let {R1 , . . . , Rm}, {Rm+1 , . . . , Rn }
be a partition of all schemata against which R is mapped in M (i.e.,
{R1 , . . . , Rm, Rm+1 , . . . , Rn } is the set of schemata {X ∈ R | ∃M ∈ M :
R ∈ f rom(M) and to(M) = X}). Copy R to a new schema R′ . Copy
all the mappings M with to(M) = R and change all occurrences of predi-
cates in R by their copies in R′ . For all mappings in M against schemata
Rm+1 , . . . , Rn , replace the predicates from R by their copies in R′ .
This operation is close to being the inverse of the previous merge operation.

4. Eliminate an auxiliary schema R by unfolding the mappings from R with
the mappings against R, if all the constraints thereby created are cind’s.
This condition is guaranteed to hold if all mappings are GAV.

5. COMPOSEM (A)
Create a composition of (existing) mappings around a (now auxiliary)
schema A, as described in the definition of composite mappings.

6. Ungroup a composite mapping. This is needed when an auxiliary schema
has matured and is to be (re-)used outside the mapping.
Figure 6.3: Complex model management operations.
102 CHAPTER 6. MODEL MANAGEMENT

is either marked “source” or “logical”. A schema is called purely logical if it does
not contain source predicates. Relational attributes may be named and typed.
If they are unnamed, we refer to them by their index. Relational predicates are
unique across all schemata – they are identified by their schema id in combination
with their predicate name.
A mapping M maps from a set of schemata in R1 , . . . , Rn ∈ R (denoted
f rom(M) = {R1 , . . . , Rn }) against a single schema R (denoted to(M) = R).
We require that to(M) 6∈ f rom(M).
Mappings are either elementary or composite. An elementary mapping Σ
is a set of cind’s where the subsumer sides of the constraints only use logical
predicates from R and the subsumed sides only use predicates from R1 ∪. . .∪Rn .
The dependency graph of an elementary mapping thus has a diameter of one. A
composite mapping M can be created from a schema A, a mapping M0 , and a
set of mappings {M1 , . . . , Mn } if

1. A ∈ f rom(M0 ),

2. A = to(Mi ) for each 1 ≤ i ≤ n,

3. A is purely logical and

4. A is not used in any other mapping in M besides M0 , . . . , Mn .

A is called an auxiliary schema. We require the cind’s in the union of all
elementary mappings of a composite mapping to be acyclic.
We do not provide an exhaustive list of all model management operations
imaginable. Figure 6.1 and Figure 6.2 list operations for the manipulation of
schemata and mappings, respectively. Figure 6.3 shows some of the more inter-
esting complex operations. Model management software can do a very useful job
in supporting a human expert in the manipulation tasks. For instance, when
a new attribute is added to a relation, all of its occurrences in cind’s can be
automatically expanded with a new existentially quantified variable.

6.2 Managing the Change of Schemata and Re-
quirements
The lack of a global schema against which sources can be integrated leads to
the problem that mappings grow with the square of the number of schemata.
Together with the prospect that schemata may evolve, this leads to a serious
management and maintenance problem.
We approach this problem using two principal techniques. These are the de-
coupling of dependencies between mappings with respect to change (Section 6.2.1)
6.2. MANAGING CHANGE 103

and the merging and clustering of design artifacts of the data integration archi-
tecture wherever possible to reduce redundancy and the number of such artifacts
to be managed (Section 6.2.2).

6.2.1 Decoupling Mappings
Given a number of schemata and mappings expressing dependencies between
them, a risk exists that some minor modification to a mapping (which may be
complex and work-intensive to design) renders its complete redesign necessary.
Similarly, the change of a schema or the data integration requirements regarding
a schema may invalidate several mappings.
Two goals are immediate consequences of this:

• Firstly, change of a set of views should remain as local as possible. When-
ever a source is added, we only want to add a single (or few) logical views,
but hopefully do not have to carry out a major redesign of mappings. It has
been observed that local-as-view integration supports the simple addition
and removal of source mappings (see Section 3.9).

• Secondly, mappings should decouple source and integration schemata from
each other in the sense that the change of an integration schema (i.e.,
schema evolution) or its data integration requirements have a minor im-
pact on the “other end” of a mapping, the part of the description that is
responsible for integrating sources.

Composite mappings as defined in the previous section permit the design of
layers of inter-schema constraints that may be attributed different roles. The
high expressiveness of our query rewriting formalism allows for such layers, for
instance, to be either sets of LAV or GAV views. The resulting design potential
enables us to create mappings that make intuitions regarding likely future change
explicit in the mappings and to prepare for this change. We can attribute dedi-
cated integration roles to individual layers, as shown in the following example.

Example 6.2.1 Let there be a fixed integration schema R with a single relation
R.book, against which we would like to integrate four sources S1 , S2 , S3 , S4 and
five source relations S1 .book, S2 .book, S3 .book, S4 .sales and S4 .categories.
We define a composite mapping between R and its sources that consists of
three layers (created using operation 5 of Figure 6.3 twice) and two auxiliary
schemata, A1 and A2 . We use three auxiliary predicates, A1 .book, A1 .sales, and
A2 .s′4 . The outermost, a GAV mapping from S4 to A2 , takes over the task of
pre-filtering sources. The middle mapping from A2 to A1 follows the local-as-
view approach and takes over the main source integration role. The innermost
mapping, again GAV, projects from our well-designed auxiliary schema A1 to R.
Consider the following constraints. (See also Figure 6.4.)
104 CHAPTER 6. MODEL MANAGEMENT

S1

S2

GAV

LAV
R A1
S3 M3
M1

GAV
A2 S4
M M2

Figure 6.4: Data integration infrastructure of Example 6.2.1. Schemata are vi-
sualized as circles and elementary mappings as arrows.

M3 : “Pre-filtering” (GAV); f rom(M3 ) = {S4 }, to(M3 ) = A2
A2 .s′4 (Name, P roducer, P rice, Sales, Units) ←
S4 .sales(CategoryId, Name, P roducer, P rice, Sales, Units),
S4 .categories(CategoryId, ”Books”).

M2 : “Source integration” (LAV); f rom(M2 ) = {S1 , S2 , S3 , A2}, to(M2 ) = A1

{hIsbn, Name, Authori | S1 .book(Isbn, Name, Author)} ⊆
{hIsbn, Name, Authori | ∃P rice, P ublisher :
A1 .book(Isbn, Name, Author, P rice, P ublisher)}

{hIsbn, Name, P ublisheri | S2 .book(Isbn, Name, P ublisher)} ⊆
{hIsbn, Name, P ublisheri | ∃Author, P rice :
A1 .book(Isbn, Name, Author, P rice, P ublisher)}

{hName, Author, Sales, Unitsi | S3 .book(Name, Author, Sales, Units)} ⊆
{hName, Author, Sales, Unitsi | ∃Isbn, P rice, P ublisher :
A1 .book(Isbn, Name, Author, P rice, P ublisher),
A2 .sales(Isbn, Sales, Units)}

{hName, P ublisher, P rice, Sales, Unitsi |
A2 .s′4 (Name, P ublisher, P rice, Sales, Units)} ⊆
{hName, P ublisher, P rice, Sales, Unitsi | ∃Isbn, Author :
A1 .book(Isbn, Name, Author, P rice, P ublisher),
A2 .sales(Isbn, Sales, Units)}

M1 : “Customizing” (GAV); f rom(M1 ) = {A1 }, to(M1 ) = R

R.book(Name, Author, P rice, P ublisher) ←
A1 .book(Isbn, Name, Author, P rice, P ublisher).
6.2. MANAGING CHANGE 105

We have created the GAV view A2 .s′4 assuming that CategoryId is only used
in that source, and have anticipated that no other future sources will provide it,
making it easier to leave the schema against which the LAV views are mapped
unchanged. On the other hand, ISBN codes are or will be provided by several
sources and are relevant to integration, although our legacy integration schema
does not know them. As a consequence, we have created an auxiliary integra-
tion schema, and provide a GAV mapping between the auxiliary and the legacy
integration schema. We have also added a “sales” predicate to it, assuming that
many sources will provide sales information and our action will save us from
creating many GAV views that project these attributes out. 

Example 6.2.1 has used a three layer (GAV-LAV-GAV) integration strategy,
where dedicated roles (1) customizing, (2) source integration and (3) pre-filtering
were assigned to the three layers M1 , M2 , and M3 . The LAV layer M2 assumes
the role of taking over most of source integration. If sources have to be integrated
against an information system for which the schema lacks properties necessary for
LAV integration, the LAV layer integrates against an auxiliary schema (schema
A1 in Example 6.2.1) that extends the integration schema by these properties.
The first (GAV) layer M1 maps the predicates of the auxiliary schema against
the (legacy) integration schema. The third layer M3 may be used to filter out
data or project out attributes that are irrelevant for the integration purpose at
hand, such that the (auxiliary) integration schema and with it the LAV views do
not have to be changed more often than absolutely necessary.
Intuitively, this strategy should allow for convenient and maintainable data
integration in a large number of scenarios. The LAV layer provides locality
of change when sources are added (or deleted), and the entirety of these three
layers facilitates decoupling when an integration schema changes. Changes to
an integration schema can often be absorbed by the pre-filtering GAV views
of mappings from such a schema and the customizing GAV views of mappings
against such a schema. Thus, changes to the data integration infrastructure
usually remain local and reasonably simple to manage.

Adding Sources
This motivates the following steps for adding sources2 (see Figure 6.5 for the
development stages of the set of views of a given legacy integration schema):

• Initially, we attempt to use LAV to integrate the sources against the inte-
gration schema.
2
Of course, the rules given here should be followed less strictly if the designer of mappings
anticipates some future change and designs a more sophisticated auxiliary integration schema
that deviates more from the legacy integration schema.
106 CHAPTER 6. MODEL MANAGEMENT

Integration schema Integration+
Query Pre-filtering
View layer

GAV
LAV
IM Query

GAV

GAV
LAV
Query Integration IM

LAV
IM
Query Customizing+
Integration+

GAV

LAV
IM Pre-filtering
Customizing+
Integration

Figure 6.5: The lifecycle of the mappings of a legacy integration schema.

• If there are source attributes that do not exist in the integration schema3 ,
make a choice depending on whether these source attributes are likely to
occur in many other sources or not. If the answer is yes, copy the predicates4
to which they should be most naturally added, and add the attributes. Use
the altered auxiliary schema for LAV while at the same time providing GAV
views from the altered predicates to the original versions in the integration
schema, essentially just projecting out the added attributes. This is a
nonlocal change. However, all the logical views which have been there
before and use changed predicates can be altered automatically (a simple
dummy attribute has to be introduced at the right position). Otherwise,
add a GAV view before the LAV stage that projects out these attributes.
Auxiliary schemata for LAV integration can also be generalized using the
UNFOLD operator of Figure 6.3.

• If some prefiltering of data available through sources (see Example 6.2.1)
is needed, decide whether the predicates of other future sources are likely
to be in similar ways more general than the current schema against which
LAV integration is carried out. If so, generalize the auxiliary integration
schema (if LAV integration is carried out against the legacy schema, copy it
first) and provide proper GAV views. Otherwise, add a GAV view between
a source and the LAV views.

Of course there is a varying degree of intuition that can be put into auxiliary
integration schemata in order to facilitate future maintenance. On the parsimo-
nious side, auxiliary integration schemata are only changed when this is really
needed. For the other extreme we may attempt to design a kind of “global”
3
For instance, this is the case for the Isbn attribute of several sources in Example 6.2.1.
4
That is, create an auxiliary integration schema that is equal to the integration schema
apart from a number of predicates that are adapted to be able to map the sources in question.
6.2. MANAGING CHANGE 107

IM

GAV

GAV
LAV
IM AUX

GA
1

V

GAV
AUX

LAV
1+2

GA
GAV

GAV
LAV
AUX

V
IM
2 IM

Figure 6.6: Merging auxiliary integration schemata to improve maintenance.

integration schema, allowing to combine the source integration of several similar
information systems that subscribe to similar sources. This is discussed in more
detail in the following section.

6.2.2 Merging Schemata
The second main technique for simplifying the management of schemata and map-
pings in our architecture is based on the attempt to merge (auxiliary) schemata
in the tradition of [BLN86, KDB98] (using the MERGE operation of Figure 6.3)
whenever possible, or even to develop global schemata of limited scope5 that are
well designed and prepared for kinds of future change that are likely to occur.

Reusing Auxiliary Schemata
It may be reasonable to use the predicates of an auxiliary integration schema of
an information system rather than its legacy schema as sources to yet another
information system. This is particularly appropriate if the intuitively perceived
quality of the former is much higher than the quality of the latter. Other reasons
may be that the GAV views mapping the auxiliary integration schema against
the legacy integration schema filter out relevant data.
This leads us to the possibility of reusing auxiliary integration schemata,
which may eliminate redundant work and greatly simplify the maintenance task.
Such a step may be justified if several information systems have similar integra-
tion requirements (need similar information from sources) and if the adjustments
that will be needed are expected to correlate heavily when it comes to change
of sources. If this is the case, auxiliary integration schemata can be merged into
one. The schema merging task can for example be carried out by defining a suit-
able “more global” auxiliary schema for the given auxiliary integration schemata,
defining appropriate GAV views to map the predicates of the old schemata against
the new one, and then generalizing these schemata and their mappings by un-
folding (by using the UNFOLD operation of Figure 6.3).
5
These are similar to export schemata in federated databases [SL90]
108 CHAPTER 6. MODEL MANAGEMENT

IS

Src1

Integration
Schemata IS AUX Src2 Sources

Src3

IS

Figure 6.7: A clustered auxiliary schema. Schemata are displayed as circles and
mappings as arrows.

Clustering
For instance, consider again the case of the LHC project (see Section 1.3). There
are groups of information systems that, although they are based on different
schemata, satisfy similar needs (are in the same stage of the project lifecycle)
for different subprojects. For such clusters, it may be wise to create a “global”
information system or data warehouse (from which the individual information
systems basically receive their data through a simple GAV mapping) whose aim
is restricted to that particular step of the lifecycle (as noted, building a global
schema for the whole lifecycle may not be possible), and which concentrates
source integration against its global schema.
Figure 6.7 depicts such a shared auxiliary schema. Even if data integration
is carried out on demand (i.e. using the “lazy approach” to data integration
[Wid96]), one can think of such an approach as an analogy to data warehouses
(the clustered schemata) and data marts (the individual integration schemata).
The SPLIT operation of Figure 6.3 allows to take back clustering decisions
if integration schemata making use of such “global” schemata evolve in different
ways and the clusters become unsustainable.
The creation of a “global” auxiliary integration schema for several similar
information systems also simplifies the task of avoiding circularities in definitions
of constraints caused by information systems mutually using each other’s virtual
predicates.

6.3 Managing the Acyclicity of Constraints
It is clearly a goal to have the set of all cind’s in a data integration system be
acyclic, as that property guarantees the computability of rewritings. Cyclic sets
of cind’s mean a self-referential definition of the source-to-integration predicate
relationships. Rewritings produced using the results of Chapter 5 may in theory
6.3. MANAGING THE ACYCLICITY OF CONSTRAINTS 109

be of infinite size.
One could give up the completeness requirement and could produce rewrit-
ings that are guaranteed to be sound but may be incomplete, simply by setting a
threshold to processing time or the number of constraints used. Our intuition is
that in practice, when real-world constraints for data integration are encoded, the
rewriting process will terminate with a complete result in most cases. Alterna-
tively, the query rewriting tool could, given a query and with some justification,
cut away e.g. those cind’s whose directed edges in the dependency graph are most
distant from the predicates in the query and which occur in a cycle.
If the process of designing mappings between schemata is computer-supported,
a system could help to avoid such situations. Acyclicity can be enforced auto-
matically all through the design process of mappings and should not be perceived
as too restrictive in that case.
The clustering of auxiliary schemata combining logical predicates that rep-
resent integrated sources and which are to be connected to several “subscriber”
schemata clearly supports the goal of escaping cyclicity. In the extreme case, one
could aim at defining auxiliary schemata that are commonly used by all informa-
tion systems requiring access to certain resources, while making sure that none
of the mappings against these resources share any of the logical predicates used
in the earlier mappings.
110 CHAPTER 6. MODEL MANAGEMENT
Chapter 7

Outlook

This chapter first presents the problem of providing physical data independence
under schema evolution in Section 7.1. This is another realistic application of
query rewriting with cind’s, outside of data integration. It is a straightforward
generalization of the problem of maintaining physical data independence analo-
gous to the transition from data integration via the problem of answering queries
using views to data integration by query rewriting with cind’s.
In the remainder of the chapter, we discuss extensions of query rewriting
with cind’s (which has so far only been considered in the context of relational
conjunctive queries) that are analogous to those that have been proposed for the
problem of answering queries using views. A few issues worth considering are
• Recursive queries. We address the query rewriting problem with recur-
sive (datalog) queries and nonrecursive sets of cind’s in Section 7.2. This
problem can be solved easily as a generalization of the work in [DG97].
• Sources with binding patterns within the data integration architecture pre-
sented in Chapter 4 are relevant for two reasons. Firstly, this feature may
be required for the integration of sources with restricted query interfaces
such as legacy systems.
Secondly, this allows to include procedural code for transforming data. This
may permit a gateway to different approaches to data integration that may
coexist in a heterogeneous data integration infrastructure. Another appli-
cation may be procedures that implement complex data transformations.
Of course, it has been observed that most practical database queries are
of very simple nature, and that very restricted query languages (with their
favorable theoretical properties) cover most practical needs, particularly of
non-expert users. This, however, does not always remain true. Certain
classes of queries that are needed in the real world (particularly in engi-
neering environments such as in our use case of Section 1.3) are sufficiently
hard that cannot be carried out using the query language supported by the
data integration platform and the underlying reasoning method.

111
112 CHAPTER 7. OUTLOOK

Integration Interface
Schema Schema

Interface with binding
pattern exported by
the procedure cross-
cind Procedure
constraints
Relations accessed
by the procedure

"Source" "Source"
Schema Schema

(A) (B)

Figure 7.1: A cind as an inter-schema constraint (A) compared to a data transfor-
mation procedure (B). Horizontal lines depict schemata and small circles depict
schema entities. Mappings are shown as thin arrows.

The solution to this is to encapsulate advanced data transformations in a
“procedure”, that is, a construct that, for the purposes of data integration
and query rewriting, is only described externally, by its interface. The
procedure itself may contain a query in a highly expressive query language
or a piece of code in a high-level programming language.
The tradeoff made is the following: Query rewriting reasoning is simplified
and often only made possible, and certain complicated queries may be hard-
wired in efficient, problem-specific code. On the downside, the completeness
of rewriting compared to queries that are not just externally described is
lost when procedures are used.
If such data transformation procedures are embedded in the data integra-
tion architecture in the sense that they read out (possibly integrated) data
from information systems that are inside the infrastructure as well, one
may describe constraints that hold between interfaces and schemata of ac-
cessed data (see Figure 7.1) using e.g. a description logics formalism such
as in [BD99]. Constraints of this kind could be used to bound the query
rewriting process and eliminate irrelevant rewritings. Such a hybrid ap-
proach of query rewriting and description logics reasoning would be highly
interesting, though necessarily incomplete.
The query rewriting problem with binding patterns in the case of acyclic
sets of cind’s can be reduced to the problem addressed in [DGL00] by the
transformation described in Section 7.2.

• Object-oriented and semistructured schemata and queries. We have dis-
cussed the equivalence of (the range-restricted versions of) nested relation
7.1. PHYSICAL DATA INDEPENDENCE 113

calculus and relational calculus in Section 2.5. Given this, the rewriting
of conjunctive nested relation calculus queries and analogous constraints
can be simulated by the relational case by a simple syntactic transforma-
tion (see e.g. Example 2.5.1). This covers a practically relevant class of
queries in the complex object model that can be mapped straightforwardly
to object-oriented data models (see also [LS97]).
Semistructured data models (e.g. OEM [AQM+ 97] or ACeDB [BDHS96])
have recently received much interest due to the vision of considering the
World Wide Web as a single large database [AV97a, FLM98], and the rise of
XML-related technologies as a major standard for data exchange [ABS00].
The semistructured case can to a certain extent be seen as a special case of
the object-oriented. However, a special case of recursive queries – regular
path queries – are an important aspect of semistructured database queries
[CM90, Abi97, AV97b]. We address the rewriting of recursive queries under
cind’s in Section 7.2, as mentioned. For local-as-view integration in the
semistructured context, particularly with regular path views, see [PV99,
CDLV99, CDLV00b, CDLV00a].

• Conjunctive queries with inequalities. Although practically relevant, this
issue is left open for future research. A special case is discussed in Footnote 3
in Section 7.1.
Query rewriting with cind’s and functional dependencies is another topic
of future research.

7.1 Physical Data Independence under Schema
Evolution
7.1.1 The Classical Problem
Database systems are based on the assumption of a separation between a logical
schema and a physical storage layout, which represents an important factor for
their popularity. In fact, however, this independence between the logical and the
physical schema is not really given in state-of-the-art database systems. This is
at least true for relational database systems, where relations are usually really
stored as files, which are quite straightforward serializations of the data under the
logical schema. For object-oriented schemata, the physical and logical schemata
in practice do not coincide that closely. Otherwise, there would be too much
redundancy. Still, there is usually a fixed canonical relationship between physical
and logical schemata.
True physical data independence would be worthwhile, as it would allow to
define a logical schema according to design and application requirements and
114 CHAPTER 7. OUTLOOK

required_course

name
name
teaches
faculty course_id
course

date
exam_taken
professor
grade
works_in
st_id
leads student
name
dept_id
advisor

department
phd masters
name student student

address research_area second_period

Figure 7.2: EER diagram of the university domain (initial version).

a physical schema optimized for performance. Currently, the coupling between
physical and logical schemata does not permit this, requiring to depart from
schemata that follow domain conceptualizations to attain satisfactory perfor-
mance.
Work on improving this situation (in particular, GMAP [TSI94]) has defined
physical storage structures as materialized views over the logical schema. That
way, answering queries requires local-as-view query rewriting (which is not harder
than NP-complete in the size of the query [LMSS95]), and the database update
problem is comparatively simple (it is the view maintenance problem [AHV95],
concerned with propagating changes to base tables incrementally to views, s.t.
views do not need to be fully refreshed whenever a change occurs). This task
would be substantially more complicated if the relationship between the logical
schema and the physical storage structures were defined the other way round,
i.e., the logical relations as views over the physical. In that case, the view update
problem [BS81, FC85] would have to be solved. The approach of [TSI94] also
allows to improve performance for classes of similar queries that are often asked,
simply by adding further storage structures that are defined as views similar to
those queries.

Example 7.1.1 We use the popular university domain that has been previously
used to communicate the essentials of the maintenance of physical data indepen-
7.1. PHYSICAL DATA INDEPENDENCE 115

dence [TSI94, Lev00]. Consider the logical schema of Figure 7.21 . This translates
into the following relational schema. Primary key attributes are underlined.

v1 .student(StudId, Name)
v1 .masters student(StudId, SecondPeriod)
v1 .phd student(StudId, ResearchArea, Advisor)
v1 .professor(Name, Leads DeptId)
v1 .faculty(Name)
v1 .course(CourseId, Name, RequiredExam CourseId, CurriculumName)
v1 .teaches(Name, CourseId)
v1 .exam taken(StudId, CourseId, Date, Grade)
v1 .department(DeptId, Name, Address)
v1 .works in(FacName, DeptId)

All students are either masters or PhD students. Full professors are managed
separately from other faculty (e.g. research or teaching assistants). Each professor
leads a department. Faculty may work in possibly several departments. Full
names of professors and other faculty are assumed to be unique in the combined
domain of such names2 . Courses are taught by professors or other faculty, have an
id number and may require up to one other course for which students must have
successfully passed the exam to be admitted. If a course has no such requirement,
a NULL value is stored for the attribute RequiredExam CourseId, rather than a
course id. PhD students have a professor as their advisor and an assigned area
of research. Masters students are either in their first or second period of their
studies, and this state is stored as a flag second period.
Let us now have the following physical storage structures, which are defined
as views over the logical schema.

m1 (StudId, StudName, Area, Advisor, DeptId, DeptName, DeptAddress) ←
v1 .student(StudId, StudName),
v1 .phd student(StudId, Area, Advisor),
v1 .works in(Name, DeptId),
v1 .department(DeptId, DeptName, DeptAddress).

m2 (Name, LeadsDeptId, DeptName, DeptAddress) ←
v1 .professor(Name, LeadsDeptId),
v1 .department(LeadsDeptId, DeptName, DeptAddress).

m3 (StudId, StudName, CourseId) ←
v1 .student(StudId, StudName),
1
The schema is presented as an Extended Entity Relationship (EER) diagram [TYF86,
Che76], i.e., with is-a relationships, which are drawn as arrows with white triangular heads.
2
We intentionally outline a less-than-perfect schema.
116 CHAPTER 7. OUTLOOK

v1 .course(CourseId, CourseName, Req, Curriculum),
v1 .exam taken(StudId, Req, Date, Grade).

Now consider the following query, which asks for names of PhD students who
work (e.g. as teaching assistants) in departments not lead by their advisors3 .

q(StudName) ← v1 .professor(AdvisorName, LDeptId),
v1 .department(LDeptId, LDeptName, LDeptAddress),
v1 .student(StudId, StudName),
v1 .phd student(StudId, Area, AdvisorName),
v1 .works in(StudName, SDeptId),
v1 .department(SDeptId, SDeptName, SDeptAddress),
LDeptAddress 6= SDeptAddress.

In this context, materialized views – the physical storage structures – are
assumed complete and up-to-date. Thus, view m2 , for instance, has the meaning

{hName, LeadsDeptId, DeptName, DeptAddressi |
m2 (Name, LeadsDeptId, DeptName, DeptAddress)} ≡
{hName, LeadsDeptId, DeptName, DeptAddressi |
v1 .professor(Name, LeadsDeptId) ∧
v1 .department(LeadsDeptId, DeptName, DeptAddress)}

By solving the problem of answering queries using views, the following equiv-
alent rewriting of the input query can be found, in which all predicates of the
logical schema have been replaced by materialized views.
3
This query contains an inequality. Since the constraints (views) do not contain any in-
equalities, the query may be decomposed into

q(StudN ame) ← q ′ (StudN ame, LDeptAddress, SDeptAddress),
LDeptAddress 6= SDeptAddress.

q ′ (StudN ame, LDeptAddress, SDeptAddress) ← v1 .professor(AdvisorN ame, LDeptId),
v1 .department(LDeptId, LDeptN ame, LDeptAddress),
v1 .student(StudId, StudN ame),
v1 .phd student(StudId, Area, AdvisorN ame),
v1 .works in(StudN ame, SDeptId),
v1 .department(SDeptId, SDeptN ame, SDeptAddress).

Thus, our algorithms from Chapter 5 are sufficient for finding maximally contained rewritings
of conjunctive queries with inequalities under sets of cind’s without inequalities. Of course, the
compositionality of conjunctive queries is preserved when inequalities are introduced. Thus,
the rewriting of q ′ can be unfolded with q to obtain a maximally contained positive rewriting
with inequalities.
7.1. PHYSICAL DATA INDEPENDENCE 117

q(StudName) ←
m1 (SId, StudName, A, P rof Name, SDId, SDName, SDeptAddress),
m2 (P rof Name, LDId, LDName, LDeptAddress),
LDeptAddress 6= SDeptAddress.

Note that the physical storage structures m1 , m2 and m3 are not sufficient to
fully cover the logical schema; For instance, faculty other than PhD students are
not represented and “teaches” relationships are nowhere stored. Thus, additional
physical structures would be needed in practice. 

It is easy to see that the problem of providing physical data independence is of
wide practical importance. Note that in [TSI94], each physical storage structure
is indexed over either a relation attribute or a ROW id (as a relational equivalent
of object identifiers. Since the work is presented in the light of a semantical data
model, the term object id is used as is, however.) The query rewriting problem
thus becomes the problem of answering queries using views with binding patterns,
as discussed in Section 3.6. Binding patterns, however, are considered in a weak
form – if no rewriting can be produced, binding patterns are ignored (equivalent
to ignoring an index and scanning the whole relation or materialized view).

7.1.2 Versions of Logical Schemata
Let us now assume that logical schemata may evolve. For several reasons, it may
be desirable not to rebuild storage structures each time schemata evolve.

• Physical storage structures (currently) need to be designed manually for
optimizing performance4 . This requires expert work, which often is not
justified for a minor schema change that does not greatly affect the appro-
priateness of current physical storage structures.

• Materialized views may be very large and be accessed rarely, such that
the cost of rebuilding physical structures relative to the cost of accessing
them must not be assumed zero. This is for instance the case in very large
(Terabyte or Petabyte) scientific repositories that are written only once –
to tertiary storage, e.g. tape robots – and where individual data records are
subsequently only accessed very sparingly. In that case it is worthwhile
to leave physical structures unchanged whenever possible and define new
versions of logical schemata relative to existing logical schemata versions as
well as the physical structures.

• Sometimes, data in physical storage structures must not be lost when new
logical schema versions do not make use of them anymore. Reasons for that
4
According to [Lev00], this is an important area of future database research.
118 CHAPTER 7. OUTLOOK

required_course

name
name
teaches
faculty course_id
course

date
exam_taken
name grade
works_in
professor st_id
student
research name
dept_id
leads interest
advises
department
graduate undergrad.
name phd student student
program
address major
phd_program_id

Figure 7.3: EER diagram of the university domain (second version).

may be that a database may still be addressed under the old logical schema
by certain applications or that there is reason to expect future schema
versions to make use of these data again.

• Physical storage structures may be read-only or replica of databases that
are offline (e.g. in mobile, distributed applications).

Given that concepts in different schema versions may experience true shift
of meanings (concept mismatch), cind’s manifest themselves as appropriate for
encoding such inter-schema dependencies. We next give an example showing why
query rewriting with cind’s may be relevant in this context. A number of serious
problems are left open, however, and are shortly summarized after this example,
at the end of the section. The main assumption that we make is that queries over
the logical schema may be translated into maximally contained positive queries
(rather than equivalent conjunctive queries5 ) over the storage structures.

Example 7.1.2 Let us now define the following alterations to the logical schema
v1 of Example 7.1.1. Professors are now members of the faculty. The university
changes from a pure graduate school to also accommodate undergraduate stu-
dents. Both masters and PhD students are replaced by a new category, graduate
5
Note that a query equivalent to a conjunctive query under a set of cind’s must itself be a
conjunctive query.
7.1. PHYSICAL DATA INDEPENDENCE 119

students. The two periods of masters studies cease to exist, but there is a new
field, “major”, for undergraduates. PhD research areas are represented by a log-
ical relation research interest, which is also used for managing the research areas
of faculty. There is a new relation phd program, which has its own key referenced
by a new advises relationship with a professor. Not every professor leads a de-
partment anymore, so there is a new relation leads. The schema is again shown
as an EER diagram in Figure 7.3.

v2 .student(StudId, Name)
v2 .undergraduate student(StudId, Major)
v2 .graduate student(StudId)
v2 .phd program(Id, StudId)
v2 .research interest(Name, Area)
v2 .advises(Advisor, PhdProgramId)
v2 .faculty(Name)
v2 .professor(Name)
v2 .course(CourseId, Name, RequiredExam CourseId, CurriculumName)
v2 .teaches(Name, CourseId)
v2 .exam taken(StudId, CourseId, Date, Grade)
v2 .department(DeptId, Name, Address)
v2 .leads(ProfName, DeptId)
v2 .works in(FacName, DeptId)

We define the following cind’s and leave cind’s that map predicates whose
meanings do not change from v1 to v2 as a (very) simple exercise for the reader.

{hStudId, StudName, P rof Name, Areai | ∃P hDP rogramId :
v2 .student(StudId, StudName) ∧ v2 .graduate student(StudId) ∧
v2 .research interest(StudName, Area) ∧
v2 .advises(P rof Name, P hDP rogramId) ∧
v2 .phd program(P hDP rogramId, StudId)} ⊇
{hStudId, StudName, Advisor, Areai |
v1 .phd student(StudId, Advisor, Area) ∧ v1 .student(StudId, StudName)}

{hName, DeptIdi | v2 .professor(Name) ∧ v2 .faculty(Name) ∧
v2 .works in(Name, DeptId) ∧ v2 .leads(Name, DeptId)} ⊇
{hName, Leads DeptIdi | v1 .professor(Name, Leads DeptId)}

v2 .graduate student(StudId) ← v1 .masters student(StudId, SecondP eriod).

With this second version of the logical schema, it is also necessary to define
additional physical storage structures to accommodate new data such as under-
graduate majors:
120 CHAPTER 7. OUTLOOK

m4 (StudId, StudName, Major) ←
v2 .student(StudId, StudName),
v2 .undergraduate student(StudId, Major).

A subsequent third version of the logical schema could be defined using cind’s
relating to predicates of the previous versions as well as the physical storage
structures. 

As mentioned, we have left a number of important aspects of the problem of
maintaining physical data independence under schema evolution out of consider-
ation. In the context of this problem, query rewriting usually aims at producing
equivalent rather than maximally contained rewritings. If no equivalent one can
be found, no rewriting at all is produced. Rewritings over physical storage struc-
tures are usually assumed to return the same results as the original queries over
the logical schema. The problem of finding equivalent rewritings over cind’s,
however, entails cyclic sets of such constraints, for which we know that neither
maximally contained nor equivalent rewritings can be computed in general.
There are two pragmatic solutions to this problem, apart from the obvious
one of searching for an equivalent rewriting up to a time or memory consumption
threshold. Firstly, one could define maximally containedness as the “correct”
semantics. That way, results will be complete for the case that an equivalent
rewriting exists, and logically still justified otherwise6 .
Alternatively, one could first compute the maximally contained rewriting of a
query (over an acyclic set of cind’s composed of containment rather than equiv-
alence constraints) and then reverse the containment relationships in the cind’s
and test if any of the conjunctive queries in the maximally contained rewriting
contains the input queries. This would be a sound but theoretically incomplete
approach to producing maximally contained rewritings. In practice, however, it
would probably well coincide with user’s expectations. Note that this requires
that each cind in the constraints base individually expresses an equivalence rela-
tionship, and positive queries such as seen in the above example (e.g. the disjoint
partition of PhD and masters students) cannot be expressed7 .
Another problem is related to propagating updates that are stated in terms
of the logical schema into the appropriate storage structures. In the classical ap-
proach to maintaining physical data independence, where physical storage struc-
tures are defined as views over the logical schema, updating these structures is
simple, as it reduces to simply refreshing these materialized views. Under our
problem definition, however, a generalized version of the much more involved
problem of updating views is faced [BS81, FC85, AHV95].
6
Certainly, design flaws in the physical storage structures – which do not permit to insert
data or answer certain queries although this should be possible from the point of view of the
logical schema – are harder to debug if maximally contained rewritings still return nonempty
results in cases where no equivalent rewritings exist.
7
This would require a major change of framework.
7.1. PHYSICAL DATA INDEPENDENCE 121

Finally, an issue that we have left out of consideration is that it may be useful
to have storage structures defined using binding patterns (that are, however,
weak in the sense that if no rewriting can be found that obeys them, the best
such rewriting – according to some cost metrics – that can be found should be
chosen). That way, indexes are special cases of such storage structures where
index keys are defined as bound [TSI94].
An interesting technique for obtaining equivalent rewritings with cind’s has
not been discussed so far. It is based on the idea of reversing the process of
computing the rewritings. In the method for computing equivalent rewritings
proposed in Chapter 5, one first attempts to obtain a contained rewriting and
then to prove it equivalent. Alternatively, one could try to obtain a subsuming
rewriting first and subsequently prove it to be contained in the input query.
This is done as follows. Let Q be the conjunctive input query and C the set of
Horn clauses obtained by normalizing the cinds. First, Q is frozen into a canonical
database I in the tradition of Example 3.6.1. Next, the consequences of the logic
program I ∪ C (where I is taken as a set of facts) are determined by bottom-
up computation. If this computation reaches a fixpoint, an equivalent rewriting
is among the queries that can be constructed from the frozen head of Q and
subsets of the facts over source predicates that are in the fixpoint of the bottom-
up computation by undoing the freezing process8 , if such a rewriting exists. An
equivalent rewriting can be determined by another bottom-up derivation (this
time in the “opposite” direction), as described in Example 5.3.1.

Example 7.1.3 Consider the query q(x, y) ← a(x, y). and the cind

{hx, yi | a(x, y)} ≡ {hx, zi | ∃y : b(x, y), c(y, z)}

and the source schema S = {b, c}. We freeze q into the facts base {a(αx , αy )}
and combine it with the three Horn clauses that result from the normalization of
the above cind. Bottom-up derivation results in the fixpoint

{a(αx , αy ), b(αx , f (αx , αy )), c(f (αx , αy ), αy )}

Only one query which satisfies the safety requirement can be constructed from
the head of q and a subset of the fixpoint over predicates in S, which is

q ′ (x, y) ← b(x, z), c(z, y).

(z is the variable which replaces the function term f (αx , αy ).) Thus, q ′ ⊇ q.
By freezing q ′ , combining the canonical database obtained with our Horn clauses,
and refuting the body of q bottom-up, we discover that q ′ ⊆ q. Thus, q ′ is an
equivalent rewriting of q. 
8
That is, variables frozen into constants are again replaced by new variables, and so are
function terms.
122 CHAPTER 7. OUTLOOK

If we can guarantee for a restricted class of queries and cind’s that fixpoints
are always reached for bottom-up derivations, we have a complete algorithm
for computing equivalent rewritings that is guaranteed to terminate. One such
class is obtained by requiring all queries (both input queries and subsumer and
subsumed queries in cind’s) to be typed conjunctive queries 9 (see e.g. [AHV95]).
For constants appearing in queries one has to require an analogous typedness
property. For instance, note that in the boolean query q ← a(1, 1). the two
constants must be assumed to be from different domains and thus different (as
their attributes are of different types).
Furthermore, one has to require that attributes added or removed between two
consecutive schema versions to be consistently existentially quantified through-
out all of their appearances in cind’s between these two schema versions. This
requirement is in general too restrictive for data integration, but mirrors quite
closely the natural semantics of schema evolution.

7.2 Rewriting Recursive Queries
We have shown earlier that the case of query rewriting with cyclic sets of cind’s is
undecidable. The case of finding a maximally contained rewriting of a recursive
(datalog) query with respect to an acyclic set of cind’s on the other hand can be
solved in a straightforward way. The result is again a recursive datalog program.
We use the technique from [DG97], which has been originally defined for the
problem of answering recursive queries using views in a minor generalization (we
work with an acyclic set of cind’s rather than a single flat “layer” of views). We
use the fact that for acyclic sets of cind’s, function terms cannot grow beyond
a certain finite depth during bottom-up derivation starting from the database.
This depth is bounded by the total number of function symbols available. There
is a unique finite set of all those Horn clauses whose head predicates appear in the
recursive query to be rewritten and that only have subgoals that are materialized
“source” predicates for which data are available10 .
Let us however first take the perspective of query answering by bottom-up
derivation, considering the combination of a set of (acyclic) cind’s and a recursive
query as a logic program. Clearly, large intermediate results are created (which
are constructed using function terms) that we want to avoid for efficiency reasons.

Example 7.2.1 Let there be the recursive query
9
Typed conjunctive queries follow the named perspective of relational algebra, i.e., each
attribute of a relation has a name unique inside the relation. Typed conjunctive queries are
only allowed to contain equijoins, i.e., only joins between relations by attributes with the same
name.
10
This set is computed by Algorithm 5.3.3 if we omit the part that tries to rewrite the input
query with the unfolded Horn clauses that have been computed.
7.2. REWRITING RECURSIVE QUERIES 123

fy (α, fv (α, β))XX e  fy (fv (α, β), γ)

XX
XXX e 
 E
 X 
 fv (α, β) E

  HH E
E
 HH
e  t  HH t Ee
  HH E
  HH E
  HH E
 HE
α β

 AA
s

 A

 A e

 A

 A

 A

 A

 A
s
 f (β, fv (β, γ))

 t y
 


 

 

 e
 

fv (β, γ)

 


t 

 

  e


γ XXXX 
XX
e fy (fv (β, γ), γ)

Figure 7.4: Fixpoint of the bottom-up derivation of Example 7.2.1.
124 CHAPTER 7. OUTLOOK

q(x, y) ← e(x, y). q(x, z) ← e(x, y), q(y, z).

which computes the transitive closure of the graph

hV = {v1 | ∃v2 : e(v1 , v2 )} ∪ {v2 | ∃v1 : e(v1 , v2 )}, E = ei

and the cinds

Σ = { {hx, zi | ∃y : e(x, y) ∧ e(y, z)} ⊇ {hx, zi | t(x, z)},
{hu, wi | ∃v : t(u, v) ∧ t(v, w)} ⊇ {hu, wi | s(u, w)} }

where t logically represents chains of two edges and s is a source of chains of
four edges. Assume now that we have the database I = {s(α, β), s(β, γ)}, where
α, β, γ are constants, the nodes of our graph. By transforming Σ into normal form
and performing bottom-up derivation, we obtain the fixpoint shown as a directed
graph in Figure 7.4. There is a tuple in q for each arc in the graph11 . Those arcs
that only belong to q are drawn as dotted lines. The result of the query is the
set of arcs between non-function term nodes, i.e. {h1, 2i, h2, 3i, h1, 3i}. 

It is possible to rewrite the cind’s and the query into a single datalog query
such that no function terms have to be introduced during query execution.
This method is a straightforward generalization of the algorithm in [DG97] to
Horn clauses that are the unfoldings of the normalized acyclic cinds, using Algo-
rithm 5.3.3.

Example 7.2.2 Consider again q and Σ of the previous example. The unfolding
of the normal form of Σ relative to the only EDB predicate e of the query is

e(x, fy (x, fv (x, y))) ← s(x, y). e(fy (x, fv (x, y)), fv (x, y)) ← s(x, y).
e(fv (x, y), fy (fv (x, y), y)) ← s(x, y). e(fy (fv (x, y), y), y) ← s(x, y).

We transform these into

eh1,fy (2,fv (3,4))i (x, x, x, y) ← s(x, y).
ehfy (1,fv (2,3)),fv (4,5)i (x, x, y, x, y) ← s(x, y).
ehfv (1,2),fy (fv (3,4),5)i (x, y, x, y, y) ← s(x, y).
ehfy (fv (1,2),3),4i (x, y, y, y) ← s(x, y).

where the structure of the function terms produced is moved into the predicate
names (e.g. eh1,fy (2,fv (3,4))i ), where integers denote the index of the variable or
constant in the head atom that corresponds to the position in the function term.
The query is now transformed bottom-up, across possibly several iterations. The
result of the first iteration is
11
To save the figure from overload, the “q” arcs are not named, unlike the other arcs.
7.2. REWRITING RECURSIVE QUERIES 125

q h1,fy (2,fv (3,4))i (x1 , x2 , x3 , x4 ) ← eh1,fy (2,fv (3,4))i (x1 , x2 , x3 , x4 ).
q hfy (1,fv (2,3)),fv (4,5)i (x1 , x2 , x3 , x4 , x5 ) ← ehfy (1,fv (2,3)),fv (4,5)i (x1 , x2 , x3 , x4 , x5 ).
q hfv (1,2),fy (fv (3,4),5)i (x1 , x2 , x3 , x4 , x5 ) ← ehfv (1,2),fy (fv (3,4),5)i (x1 , x2 , x3 , x4 , x5 ).
q hfy (fv (1,2),3),4i (x1 , x2 , x3 , x4 ) ← ehfy (fv (1,2),3),4i (x1 , x2 , x3 , x4 ).

for the first rule of q and

q hfy (fv (1,2),3),fy (4,fv (5,6))i (x1 , x2 , x3 , x5 , x6 , x7 ) ←
ehfy (fv (1,2),3),4i (x1 , x2 , x3 , x4 ),
q h1,fy (2,fv (3,4))i (x4 , x5 , x6 , x7 ).
q h1,fv (2,3)i (x1 , x5 , x6 ) ←
eh1,fy (2,fv (3,4))i (x1 , x2 , x3 , x4 ),
q hfy (1,fv (2,3)),fv (4,5)i (x2 , x3 , x4 , x5 , x6 ).
q hfy (1,fv (2,3)),fy (fv (4,5),6i (x1 , x2 , x3 , x6 , x7 , x8 ) ←
ehfy (1,fv (2,3)),fv (4,5)i (x1 , x2 , x3 , x4 , x5 ),
q hfv (1,2),fy (fv (3,4),5)i (x4 , x5 , x6 , x7 , x8 ).
q hfv (1,2),3i (x1 , x2 , x6 ) ←
ehfv (1,2),fy (fv (3,4),5)i (x1 , x2 , x3 , x4 , x5 ),
q hfy (fv (1,2),3),4i (x3 , x4 , x5 , x6 ).

for the second rule. The latter four rules combines the four function-free rewrit-
ings of the unfolded Horn clauses with the rewritings of the first rule of q. In
the subsequent iterations, the results of the previous iterations are combined. It
would consume too much space to write down the full rewriting, which contains
8 more rules for q. A single one of them is the new query goal,

q h1,2i (x1 , x5 ) ← eh1,fy (2,fv (3,4))i (x1 , x2 , x3 , x4 ), q hfy (1,fv (2,3)),4i (x2 , x3 , x4 , x5 ).

Clearly, a number of optimizations over this naive transformation are possi-
12
ble , for which we refer to [DG97]. 

This transformation can be easily automated and is applicable for all datalog
queries.

12
After all, this query is equivalent to {q(x, y) ← s(x, y). q(x, z) ← s(x, y), q(y, z).} Note
that only four of the q ... predicates (q h1,2i , q hfy (1,fv (2,3)),4i , q hfv (1,2),3i , q hfy (fv (1,2),3),4i ) created
using this naive transformation are – taking a top-down perspective – reachable from the goal
predicate q h1,2i , and rules containing others may be eliminated outright.
126 CHAPTER 7. OUTLOOK
Chapter 8

Conclusions

The approach to data integration that has been proposed in this thesis has the
following features:

• The infrastructure does not rely on a “global” integration schema as un-
der LAV. Rather, several information systems each may need access to
integrated data from other information systems. Integration schemata may
lack sophistication or even any special preparation for source integration.

• Integration schemata may contain both materialized database relations and
purely logical predicates, for which data have to be provided by means of
data integration.

• Our approach provides good support for the creation and maintenance of
mappings between information systems under frequent change. This in-
cludes good decoupling of information systems through the mappings used
for integration, such that the workload imposed on the knowledge engineer
who maintains mappings when change occurs is as small as possible. The
approach at the same time permits mappings to be designed in a reasonably
natural way, thus simplifying the modeling and mapping work and at the
same time enabling the designer to express intuitions that may be useful
for anticipating future change.

• The data integration reasoning is carried out globally, declaratively, and
uses an intuitive and accessible semantics. Mappings between several in-
formation systems are transitive, which reduces the amount of redundant
mapping work that has to be done.

• Conjunctive inclusion dependencies as inter-schema constraints allow to
deal with concept mismatch in a wide sense. This is a necessary condition
for being able to deal with autonomous and changing integration schemata.

127
128 CHAPTER 8. CONCLUSIONS

We have pointed out that data integration with multiple unsophisticated
evolving integration schemata is a problem of high relevance1 that has been insuf-
ficiently addressed so far. None of the previous work seems to be directly suitable.
Apart from management problems with respect to schemata and mappings sim-
ilar to those known from federated and multidatabases, we are confronted with
kinds of schema mismatch that require very expressive interschema constraints.
We have presented an approach based on model management and query
rewriting with expressive constraints and have discussed an architecture (Chap-
ter 4), model management operations (Chapter 6), and the issue of query rewrit-
ing (Chapter 5), a problem at the core of data integration. We have argued that
our approach supports the maintenance of the integration infrastructure by al-
lowing the modeling of mappings in a natural way and the decoupling of schemata
and mappings such that maintenance under change is simplified.
The practical feasibility of our approach has been in part shown by the imple-
mentation of the CindRew system based on the results of this thesis, and by the
benchmarks of Section 5.5. For the other part – model management – our presen-
tation was based on elementary intuitions of managing large systems that have
been widely verified and have permeated mainstream computer science thinking.
Much recent work in data integration has focussed either on procedural or
on highly structured declarative approaches meant to combine sufficient expres-
sive power with decidability (which we cannot guarantee for our approach in its
most general form). We have taken another direction, encoding a highly intuitive
class of constraints2 and providing theoretical results and an implementation for
sound best-effort query rewriting, with the intuition that practical data integra-
tion problems will often be completely solved. We have also discussed a very
important class (acyclic sets of constraints) for which we can guarantee com-
pleteness. We believe this work may be of quite immediate practical usefulness.
Plenty of material for further research has been provided in Chapter 7. A
successor project to the research that led to this thesis could be an effort to de-
velop an integrated model management and query rewriting system based on the
results presented here, however, based on an object-oriented data model. Such
a system could be of immediate usefulness to scientific communities such as the
one of high energy physics. Our query rewriting approach has an acceptability
advantage compared to other data integration approaches applicable to the set-
ting of large scientific collaborations (see Section 1.3). This is particularly true
when it comes to data integration on the Grid [FK98] with its most extensive
data volumes. Having stated this, we deem this work also a practical success,
with a clear benefit to the host of this PhD program, CERN.
1
The relevance of this work has been sufficiently argued for in Section 1.3 and Section 1.5,
and we will not reiterate this here.
2
We use conjunctive queries both in constraints and targets for rewriting. When put into a
syntax such as select-from-where queries or tableau queries [Ull88, Ull89, AHV95], conjunctive
queries can be mastered by many non-expert users.
Bibliography

[AAA+ 97] J.L. Ambite, Y. Arens, N. Ashish, C.A. Knoblock, S. Minton,
J. Modi, M. Muslea, A. Philpot, W. Shen, S. Tejada, and W. Zhang.
“The SIMS Manual: Version 2.0. Working Draft”, December 1997.
[AB88] Serge Abiteboul and Catriel Beeri. “On the Power of Languages for
the Manipulation of Complex Objects”. Technical Report TR 846,
INRIA, 1988.
[ABD+ 96] Daniel E. Atkins, William P. Birmingham, Edmund H. Durfee,
Eric J. Glover, Tracy Mullen, Elke A. Rundensteiner, Elliot Soloway,
José M. Vidal, Raven Wallace, and Michael P. Wellman. “Toward
Inquiry-Based Education Through Interacting Software Agents”.
IEEE Computer, 29(5):69–76, May 1996.
[Abi97] Serge Abiteboul. “Querying Semistructured Data”. In Proc.
ICDT’97, Delphi, Greece, 1997.
[ABS00] Serge Abiteboul, Peter Buneman, and Dan Suciu. Data on the Web.
Morgan Kaufmann Publishers, 2000.
[ABU79] Alfred V. Aho, Catriel Beeri, and Jeffrey D. Ullman. “The Theory
of Joins in Relational Databases”. ACM Transactions on Database
Systems, 4(3):297–314, 1979.
[ACPS96] S. Adali, K. S. Candan, Y. Papakonstantinou, and V. S. Subrahma-
nian. “Query Caching and Optimization in Distributed Mediator
Systems”. In Proceedings of the 1996 ACM SIGMOD International
Conference on Management of Data (SIGMOD’96), pages 137–146,
Montreal, Canada, June 1996.
[AD98] Serge Abiteboul and Oliver M. Duschka. “Complexity of Answering
Queries Using Materialized Views”. In Proceedings of the ACM
SIGACT-SIGMOD-SIGART Symposium on Principles of Database
Systems (PODS) 1998, pages 254–263, 1998.
[Age] UMBC Agents Mailing List Archive
http://agents.umbc.edu/agentslist/archive/.

129
130 BIBLIOGRAPHY

[AHV95] Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of
Databases. Addison-Wesley, 1995.

[AK92] Yigal Arens and Craig A. Knoblock. “Planning and Reformulating
Queries for Semantically-Modeled Multidatabase Systems”. In Pro-
ceedings of the First International Conference on Information and
Knowledge Management (CIKM’92), Baltimore, MD, 1992.

[AQM+ 97] Serge Abiteboul, Dallan Quass, Jason McHugh, Jennifer Widom,
and Janet L. Wiener. “The Lorel Query Language for Semistruc-
tured Data”. International Journal on Digital Libraries, 1(1):68–88,
1997.

[AS99] Albert Alderson and Hanifa Shah. “Viewpoints on Legacy Sys-
tems”. Communications of the ACM, 42(3):115–116, 1999.

[AV97a] Serge Abiteboul and Victor Vianu. “Queries and Computation on
the Web”. In Proc. ICDT’97, 1997.

[AV97b] Serge Abiteboul and Victor Vianu. “Regular Path Queries with
Constraints”. In Proceedings of the ACM SIGACT-SIGMOD-
SIGART Symposium on Principles of Database Systems, May 11–
15, 1997, Tucson, AZ USA, 1997.

[BB99] Philip A. Bernstein and Thomas Bergstraesser. “Meta-Data Sup-
port for Data Transformations Using Microsoft Repository”. IEEE
Data Engineering Bulletin, 22(1):9–14, March 1999.

[BBB+ 97] Roberto J. Bayardo Jr., William Bohrer, Richard S. Brice, Andrzej
Cichocki, Jerry Fowler, Abdelsalam Helal, Vipul Kashyap, Tomasz
Ksiezyk, Gale Martin, Marian H. Nodine, Mosfeq Rashid, Marek
Rusinkiewicz, Ray Shea, C. Unnikrishnan, Amy Unruh, and Darrell
Woelk. “InfoSleuth: Agent-Based Semantic Integration of Informa-
tion in Open and Dynamic Environments”. In J. Peckham, editor,
Proceedings of the 1997 ACM SIGMOD International Conference
on Management of Data (SIGMOD’97), pages 195–206, Tucson,
Arizona, USA, May 1997. ACM Press.

[BBMR89a] Alexander Borgida, Ronald J. Brachman, Deborah L. McGuinness,
and Lori A. Resnick. “CLASSIC: A Structural Data Model for
Objects”. In Proceedings of the 1989 ACM SIGMOD International
Conference on Management of Data (SIGMOD’89), pages 59–67,
June 1989.
BIBLIOGRAPHY 131

[BBMR89b] Ronald J. Brachman, Alex Borgida, Deborah L. McGuinness, and
Lori A. Resnick. “The CLASSIC Knowledge Representation Sys-
tem, or, KL-ONE: The Next Generation”, February 1989.

[BD99] Alex Borgida and Prem Devanbu. “Adding more DL to IDL: To-
wards more Knowledgeable Component Inter-operability”. In Proc.
of ICSE’99, 1999.

[BDBW97] J. M. Bradshaw, S. Dutfield, P. Benoit, and J.D. Woolley. “KAoS:
Toward an Industrial-strength Open Agent Architecture”. In
Jeffrey M. Bradshaw, editor, Software Agents, pages 375–418.
AAAI/MIT Press, 1997.

[BDHS96] Peter Buneman, Susan Davidson, Gerd Hillebrand, and Dan Suciu.
“A Query Language and Optimization Techniques for Unstructured
Data”. In Proceedings of the 1996 ACM SIGMOD International
Conference on Management of Data (SIGMOD’96), 1996.

[BF97] Avrim L. Blum and Merrick L. Furst. “Fast Planning Through
Planning Graph Analysis”. Artificial Intelligence, 90:281–300, 1997.

[BH91] Franz Baader and Bernhard Hollunder. “KRIS: Knowledge Rep-
resentation and Inference System”. SIGART Bulletin, 2(3):8–14,
1991.

[BLN86] Carlo Batini, Maurizio Lenzerini, and Shamkant B. Navathe. “A
Comparative Analysis of Methodologies for Database Schema Inte-
gration”. ACM Computing Surveys, 18:323–364, 1986.

[BLP00] Philip A. Bernstein, Alon Y. Levy, and Rachel A. Pottinger. “A Vi-
sion for Management of Complex Models”. Technical Report 2000-
53, Microsoft Research, 2000.

[BLR97] Catriel Beeri, Alon Y. Levy, and Marie-Christine Rousset. “Rewrit-
ing Queries Using Views in Description Logics”. In Proceedings of
the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems, May 11–15, 1997, Tucson, AZ USA, pages 99–
108, 1997.

[BM93] Elisa Bertino and Lorenzo Martino. Object-oriented Database Sys-
tems - Concepts and Architectures. Addison-Wesley, 1993.

[Bor95] Alexander Borgida. “Description Logics in Data Management”.
IEEE Transactions on Knowledge and Data Engineering, 7(5):671–
682, October 1995.
132 BIBLIOGRAPHY

[BPGL85] Ronald J. Brachman, V. Pigman Gilbert, and Hector J. Levesque.
“An Essential Hybrid Reasoning System: Knowledge and Symbol
Level Accounts in KRYPTON”. In Proceedings of the International
Joint Conference on Artificial Intelligence (IJCAI’85), pages 532–
539, 1985.

[BPS94] Alexander Borgida and Peter F. Patel-Schneider. “A Semantics and
Complete Algorithm for Subsumption in the CLASSIC Description
Logic”. Journal of Artificial Intelligence Research, 1:277–308, 1994.

[Bra83] Ronald J. Brachman. “What IS-A is and isn’t: An Analysis of
Taxonomic Links in Semantic Networks”. IEEE Computer, 16(10),
October 1983.

[BS81] F. Bancilhon and N. Spyratos. “Update Semantics of Relational
Views”. ACM Transactions on Database Systems, 6(4):557–575,
December 1981.

[BS85] Ronald J. Brachman and James G. Schmolze. “An Overview of the
KL-ONE Knowledge Representation System”. Cognitive Science,
9(2):171–216, 1985.

[CBB+ 97] R. G. G. Cattell, Douglas K. Barry, Mark Berler, Jeff Eastman,
David Jordan, Craig Russell, Olaf Schadow, Torsten Stanienda, and
Fernando Velez. The Object Database Standard: ODMG 2.0. Mor-
gan Kaufmann, 1997.

[CDL98a] Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini.
“On the Decidability of Query Containment under Constraints”. In
Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium
on Principles of Database Systems (PODS) 1998, pages 149–158,
1998.

[CDL+ 98b] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini,
Daniele Nardi, and Riccardo Rosati. “Information Integration: Con-
ceptual Modeling and Reasoning Support”. In Proc. CoopIS’98,
pages 280–291, 1998.

[CDL99] Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini.
“Answering Queries using Views in Description Logics”. In Proc.
of the 1999 Description Logic Workshop (DL’99), CEUR Workshop
Proceedings, Vol. 22, pages 9–13, 1999.

[CDLV99] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and
Moshe Y. Vardi. “Rewriting of Regular Expressions and Regular
BIBLIOGRAPHY 133

Path Queries”. In Proceedings of the ACM SIGACT-SIGMOD-
SIGART Symposium on Principles of Database Systems (PODS)
1999, pages 194–204, Philadelphia, PA, 1999.

[CDLV00a] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and
Moshe Y. Vardi. “Answering Regular Path Queries Using Views”.
In Proceedings of the IEEE International Conference on Data En-
gineering (ICDE 2000), pages 389–398, 2000.

[CDLV00b] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and
Moshe Y. Vardi. “View-Based Query Processing for Regular Path
Queries with Inverse”. In Proceedings of the ACM SIGACT-
SIGMOD-SIGART Symposium on Principles of Database Systems
(PODS) 2000, pages 58–66, Dallas, TX, 2000.

[CH80] Ashok K. Chandra and David Harel. “Computable Queries for Re-
lational Data Bases”. Journal of Computer and System Sciences,
21(2):156–178, 1980.

[CH82] Ashok K. Chandra and David Harel. “Structure and Complexity
of Relational Queries”. Journal of Computer and System Sciences,
25(1):99–128, 1982.

[Cha88] Ashok K. Chandra. “Theory of Database Queries”. In Proceed-
ings of the 7th ACM SIGACT-SIGMOD-SIGART Symposium on
Principles of Database Systems (PODS’88), pages 1–9. ACM Press,
1988.

[Che76] Peter Pin-Shan Chen. “The Entity-Relationship Model – Toward a
Unified View of Data”. ACM Transactions on Database Systems,
1(1):9–36, March 1976.

[CHS+ 95] Michael J. Carey, Laura M. Haas, Peter M. Schwarz, Manish Arya,
William F. Cody, Ronald Fagin, Myron Flickner, Allen W. Lu-
niewski, Wayne Niblack, Dragutin Petkovic, John Thomas, John H.
Williams, and Edward L. Wimmers. “Towards Heterogeneous Mul-
timedia Information Systems: The Garlic Approach”. In Proceed-
ings of the Fifth International Workshop on Research Issues in Data
Engineering: Distributed Object Management (RIDE-DOM’95),
1995.

[CJ96] D. Cockburn and N. R. Jennings. “ARCHON: A Distributed Arti-
ficial Intelligence System for Industrial Applications”. In G. M. P.
O’Hare and N. R. Jennings, editors, Foundations of Distributed Ar-
tificial Intelligence, pages 319–344. Wiley, 1996.
134 BIBLIOGRAPHY

[CKM91] Jaime G. Carbonell, Craig A. Knoblock, and Steven Minton.
“PRODIGY: An Integrated Architecture for Planning and Learn-
ing”. In Kurt VanLehn, editor, Architectures for Intelligence, pages
241–278. Lawrence Erlbaum, Hillsdale, NJ, 1991.

[CKPS95] S. Chaudhuri, R. Krishnamurthy, S. Potamianos, and K. Shim.
“Optimizing Queries with Materialized Views”. In Proceedings
of the 11th IEEE International Conference on Data Engineering
(ICDE’95), 1995.

[CKW89] Weidong Chen, Michael Kifer, and David S. Warren. “HiLog: A
Foundation for Higher-Order Logic Programming”. Technical re-
port, Dept. of CS, SUNY at Stony Brook, 1989.

[CM77] Ashok K. Chandra and Philip M. Merlin. “Optimal Implementation
of Conjunctive Queries in Relational Data Bases”. In Conference
Record of the Ninth Annual ACM Symposium on Theory of Com-
puting (STOC’77), pages 77–90, Boulder, Colorado, May 1977.

[CM90] Mariano P. Consens and Alberto O. Mendelzon. “GraphLog: a
Visual Formalism for Real Life Recursion”. In Proceedings of the
9th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems (PODS’90), 1990.

[CMS95] “CMS Technical Proposal”, January 1995.

[Cod70] E. F. Codd. “A Relational Model of Data for Large Shared Data
Banks”. Communications of the ACM, 13(6):377–387, June 1970.

[Coo] International Conferences on Cooperative Information Systems,
1996–2001.

[COZ00] P. Ciancarini, A. Omicini, and F. Zambonelli. “Multiagent System
Engineering: the Coordination Viewpoint”. In Intelligent Agents VI
- Proceedings of the 6th International Workshop on Agent Theories,
Architectures, and Languages (ATAL’99), LNAI Series, Vol. 1767.
Springer Verlag, February 2000.

[Cro94] Kevin Crowston. “A Taxonomy Of Organizational Dependencies
and Coordination Mechanisms”. Technical Report 174, MIT Centre
for Coordination Science, Cambridge, MA, 1994.

[CS93] Surajit Chaudhuri and Kyuseok Shim. “Query Optimization in the
Presence of Foreign Functions”. In Proceedings of the 19th Interna-
tional Conference on Very Large Data Bases (VLDB’93), Dublin,
Ireland, 1993.
BIBLIOGRAPHY 135

[CTP00] Peter Clark, J. Thompson, and Bruce Porter. “Knowledge Pat-
terns”. In Proceedings of the International Conference on Principles
of Knowledge Representation and Reasoning (KR’2000), 2000.

[CV92] Surajit Chaudhuri and Moshe Y. Vardi. “On the Equivalence of Re-
cursive and Nonrecursive Datalog Programs”. In Proceedings of the
11th ACM SIGACT-SIGMOD-SIGART Symposium on Principles
of Database Systems (PODS’92), pages 55–66, 1992.

[CV94] Surajit Chaudhuri and Moshe Y. Vardi. “On the Complexity
of Equivalence between Recursive and Nonrecursive Datalog Pro-
grams”. In Proceedings of the ACM SIGACT-SIGMOD-SIGART
Symposium on Principles of Database Systems (PODS) 1994, pages
107–116, Minneapolis, Minnesota, May 1994.

[CV97] Surajit Chaudhuri and Moshe Y. Vardi. “On the Equivalence of Re-
cursive and Nonrecursive Datalog Programs”. Journal of Computer
and System Sciences, 54(1):61–78, 1997.

[Cyc] Cycorp. “Features of CycL”. http://www.cyc.com/cycl.html.

[Dec95] Keith Decker. “TAEMS: A Framework for Environment Centered
Analysis and Design of Coordination Mechanisms”. In G. O’Hare
and Nicholas Jennings, editors, Foundations of Distributed Artificial
Intelligence, chapter 16, pages 429–448. Wiley Inter-Science, 1995.

[DEGV] Evgeny Dantsin, Thomas Eiter, Georg Gottlob, and Andrei
Voronkov. “Complexity and Expressive Power of Logic Program-
ming”. To appear in ACM Computing Surveys.

[DG97] Oliver M. Duschka and Michael R. Genesereth. “Answering Recur-
sive Queries using Views”. In Proceedings of the ACM SIGACT-
SIGMOD-SIGART Symposium on Principles of Database Systems,
May 11–15, 1997, Tucson, AZ USA, Tucson, Arizona, 1997.

[DGL00] Oliver M. Duschka, Michael R. Genesereth, and Alon Y. Levy. “Re-
cursive Query Plans for Data Integration”. Journal of Logic Pro-
gramming, 43(1):49–73, 2000.

[DJ90] Nachum Dershowitz and Jean-Pierre Jouannaud. “Rewrite Sys-
tems”. In Jan van Leeuwen, editor, Handbook of Theoretical Com-
puter Science, volume 2, chapter 6, pages 243–320. Elsevier Science
Publishers B.V., 1990.

[DL91] Edmund H. Durfee and Victor R. Lesser. “Partial Global Planning:
A Coordination Framework for Distributed Hypothesis Formation”.
136 BIBLIOGRAPHY

IEEE Transactions on Systems, Man, and Cybernetics (Special Is-
sue on Distributed Sensor Networks), 21(5):1167–1183, September
1991.

[DL92] Keith Decker and Victor Lesser. “Generalizing The Partial Global
Planning Algorithm”. International Journal on Intelligent Cooper-
ative Information Systems, 1(2):319–346, 1992.

[DL95] Keith Decker and Victor Lesser. “Designing a Family of Coordi-
nation Algorithms”. In Proceedings of the First International Con-
ference on Multiagent Systems (ICMAS’95), San Francisco, June
1995. AAAI Press.

[DL97a] Giuseppe De Giacomo and Maurizio Lenzerini. “A Uniform Frame-
work for Concept Definitions in Description Logics”. Journal of
Artificial Intelligence Research (JAIR), 6:87–110, 1997.

[DL97b] Oliver M. Duschka and Alon Y. Levy. “Recursive Plans for Infor-
mation Gathering”. In Proceedings of the 15th International Joint
Conference on Artificial Intelligence (IJCAI’97), Nagoya, Japan,
August 1997.

[DLNS96] Francesco Donini, Maurizio Lenzerini, Daniele Nardi, and Andrea
Schaerf. “Reasoning in Description Logics”. In G. Brewka, editor,
Principles of Knowledge Representation and Reasoning, Studies in
Logic, Language and Information, pages 193–238. CLSI Publica-
tions, 1996.

[DLNS98] Francesco M. Donini, Maurizio Lenzerini, Daniele Nardi, and An-
drea Schaerf. “AL-log: Integrating Datalog and Description Log-
ics”. Journal of Intelligent Information Systems, 10:227–252, 1998.

[DS83] R. Davis and R. G. Smith. “Negotiation as a Metaphor for Dis-
tributed Problem Solving”. Artificial Intelligence, 20(1):63–109,
January 1983.

[DSW97] Keith Decker, Katia Sycara, and Mike Williamson. “Middle-Agents
for the Internet”. In Proceedings of the 15th International Joint
Conference on Artificial Intelligence (IJCAI’97), Nagoya, Japan,
1997.

[DSW+ 99] A. J. Duineveld, R. Stoter, M. R. Weiden, B. Kenepa, and V. R.
Benjamins. “Wondertools? A Comparative Study of Ontological
Engineering Tools”. In Proc. Twelfth Workshop on Knowledge Ac-
quisition, Modeling and Management (KAW’99), Banff, Alberta,
Canada, October 1999.
BIBLIOGRAPHY 137

[DV97] Evgeny Dantsin and Andrei Voronkov. “Complexity of Query An-
swering in Logic Databases with Complex Values”. In LFCS’97,
LNCS 1234, pages 56–66, 1997.
[Etz96] Oren Etzioni. “Moving Up the Information Food Chain: Deploying
Softbots on the World Wide Web”. In Proc. AAAI’96, 1996.
[FC85] A.L. Furtado and M.A. Casanova. “Updating Relational Views”. In
W. Kim, D.S. Reiner, and D.S. Batory, editors, Query Processing
in Database Systems. Springer-Verlag, Berlin, 1985.
[FFKL98] Mary Fernandez, Daniela Florescu, Jaewoo Kang, and Alon Levy.
“Catching the Boat with Strudel: Experiences with a Web-Site
Management System”. In Proceedings of the 1998 ACM SIGMOD
International Conference on Management of Data (SIGMOD’98),
pages 414–425, Seattle, WA, June 1998.
[FFMM94] T. Finin, R. Fritzson, D. McKay, and R. McEntire. “KQML as
an Agent Communication Language”. In Proceedings of the Third
International Conference on Information and Knowledge Manage-
ment (CIKM’94). ACM Press, November 1994.
[FFR96] A. Farquhar, R. Fikes, and J. Rice. “The Ontolingua Server: a Tool
for Collaborative Ontology Construction”. In B. Gaines, editor,
Proceedings of 10th Knowledge Acquisition for Knowledge-Based
Systems Workshop (KAW96), Banff, Canada, 1996.
[FK98] Ian Foster and Carl Kesselman, editors. The Grid: Blueprint for a
New Computing Infrastructure. Morgan Kaufmann Publishers, San
Francisco, July 1998.
[FL97] Tim Finin and Yannis Labrou. “A Proposal for a new KQML Spec-
ification”. Technical Report CS-97-03, Computer Science and Elec-
trical Engineering Department, University of Maryland Baltimore
County, Baltimore, MD 21250, February 1997.
[FLM98] Daniela Florescu, Alon Levy, and Alberto Mendelzon. “Database
Techniques for the World-Wide Web: A Survey”. SIGMOD Record,
27(3):59–74, 1998.
[FMU82] Ronald Fagin, Alberto O. Mendelzon, and Jeffrey D. Ullman. “A
Simplied Universal Relation Assumption and its Properties”. ACM
Transactions on Database Systems, 7(3):343–360, 1982.
[FN71] Richard Fikes and Nils J. Nilsson. “STRIPS: A new Approach to
the Application of Theorem Proving to Problem Solving”. Artificial
Intelligence, 2(3/4), 1971.
138 BIBLIOGRAPHY

[FN00] Enrico Franconi and Gary Ng. “The ICOM Tool for Intelligent
Conceptual Modelling”. In Proc. 7th Intl. Workshop on Knowledge
Representation meets Databases (KRDB’00), Berlin, Germany, Au-
gust 2000.

[FNPB99] Jerry Fowler, Marian Nodine, Brad Perry, and Bruce Bargmeyer.
“Agent-based Semantic Interoperability in InfoSleuth”. SIGMOD
Record, 28(1):60–67, 1999.

[Fra99] Enrico Franconi, 1999. Description Logics Course Web Page. Avail-
able at http://www.cs.man.ac.uk/∼franconi/dl/course/.

[FRV95] Daniela Florescu, Louiqa Raschid, and Patrick Valduriez. “Using
Heterogeneous Equivalences for Query Rewriting in Multidatabase
Systems”. In Proc. CoopIS’95, pages 158–169, 1995.

[FVR96] Daniela Florescu, Patrick Valduriez, and Louiqa Raschid. “An-
swering Queries Using OQL View Expressions”. In Workshop on
Materialized Views in Cooperation with ACM SIGMOD, 1996.

[GEW96] Keith Golden, Oren Etzioni, and Dan Weld. “Planning with Exe-
cution and Incomplete Information”. Technical Report UW-CSE-
96-01-09, Department of Computer Science and Engineering, Uni-
versity of Washington, Seattle, February 1996.

[GF92] Michael R. Genesereth and Richard E. Fikes. “Knowledge Inter-
change Format, Version 3.0 Reference Manual”. Technical Re-
port Logic-92-1, Computer Science Department, Stanford Univer-
sity, 1992.

[GG95] Nicola Guarino and Pierdaniele Giaretta. “Ontologies and Knowl-
edge Bases: Towards a Terminological Clarification”. In N. J. I.
Mars, editor, Towards Very Large Knowledge Bases. IOS Press,
1995.

[GHB99] Mark Greaves, Heather Holmback, and Jeffrey M. Bradshaw.
“What is a Conversation Policy?”. In Mark Greaves and Jeffrey M.
Bradshaw, editors, Proceedings of the Autonomous Agents’99 Work-
shop on Specifying and Implementing Conversation Policies, pages
1–9, Seattle, Washington, May 1999.

[GHJV94] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns.
Elements of Reusable Object-Oriented Software. Addison Wesley
Professional Computing Series, October 1994.
BIBLIOGRAPHY 139

[GJ79] Michael R. Garey and David S. Johnson. Computers and Intractabil-
ity: A Guide to the Theory of NP-Completeness. W.H. Freeman &
Co., 1979.

[GK94] Michael R. Genesereth and Steven P. Ketchpel. “Software Agents”.
Communications of the ACM, 37(7):48–53, 1994.

[GKD97] Michael R. Genesereth, Arthur M. Keller, and Oliver M. Duschka.
“Infomaster: An Information Integration System”. In Proceedings of
the 1997 ACM SIGMOD International Conference on Management
of Data (SIGMOD’97), pages 539–542, 1997.

[GMLY98] Hector Garcia-Molina, Wilburt Labio, and Jun Yang. “Expiring
Data in a Warehouse”. In Proceedings of the 1998 International
Conference on Very Large Data Bases (VLDB’98), 1998. Extended
version as Technical Report 1998-35, Stanford Database Group.

[GMPQ+ 97] Hector Garcia-Molina, Yannis Papakonstantinou, Dallan Quass,
Anand Rajaraman, Yehoshua Sagiv, Jeffrey D. Ullman, Vasilis Vas-
salos, and Jennifer Widom. “The TSIMMIS Approach to Mediation:
Data Models and Languages”. Journal of Intelligent Information
Systems, 8(2):117–132, 1997.

[GN87] Michael R. Genesereth and Nils J. Nilson. Logical Foundations of
Artificial Intelligence. Morgan Kaufmann Publishers, 1987.

[Gru] Thomas R. Gruber. “What is an Ontology?”.
http://www-ksl.stanford.edu/kst/what-is-an-ontology.html.

[Gru92] Thomas R. Gruber. “Ontolingua: A Mechanism to Support
Portable Ontologies”. Technical Report KSL-91-66, Stanford Uni-
versity, Knowledge Systems Laboratory, March 1992.

[Gru93a] Thomas R. Gruber. “A Translation Approach to Portable Ontology
Specifications”. Technical Report KSL-92-71, Stanford University,
Knowledge Systems Laboratory, April 1993.

[Gru93b] Thomas R. Gruber. “Toward Principles for the Design of Ontolo-
gies Used for Knowledge Sharing”. Technical Report KSL 93-04,
Knowledge Systems Laboratory, Stanford University, 1993.

[Gua94] Nicola Guarino. “The Ontological Level”. In B. Smith R. Casati
and G. White, editors, Philosophy and the Cognitive Sciences, Vi-
enna. Hölder-Pichler-Tempsky, 1994. Invited paper presented at IV
Wittgenstein Symposium, Kirchberg, Austria, 1993.
140 BIBLIOGRAPHY

[Gua97] Nicola Guarino. “Understanding, Building, and Using Ontologies.
A Commentary to ‘Using Explicit Ontologies in KBS Development’,
by van Heijst, Schreiber, and Wielinga”. International Journal of
Human and Computer Studies, 46(2/3):293–310, 1997.

[GW00a] Nicola Guarino and Christopher A. Welty. “Identity, Unity, and
Individuality: Towards a Formal Toolkit for Ontological Analysis”.
In Proceedings of the European Conference on Artificial Intelligence
(ECAI-2000). IOS Press, August 2000.

[GW00b] Nicola Guarino and Christopher A. Welty. “Ontological Analysis
of Taxonomic Relationships”. In International Conference on Con-
ceptual Modeling (ER 2000), pages 210–224, 2000.

[Hal00] Alon Y. Halevy. “Theory of Answering Queries Using Views”. Sig-
mod Record, 29(4), December 2000.

[HGB99] Heather Holmback, Mark Greaves, and Jeffrey Bradshaw. “Agent
A, Can You Pass the Salt? The Role of Pragmatics in Agent Com-
munications”, May 1999. Submitted to Autonomous Agents’99.

[HK93] Chun-Nan Hsu and Craig A. Knoblock. “Reformulating Query
Plans for Multidatabase Systems”. In Proc. of the Second Inter-
national Conference on Information and Knowledge Management
(CIKM’93), pages 423–432, Washington, DC, 1993.

[HM85] Dennis Heimbigner and Dennis McLeod. “A Federated Architec-
ture for Information Managment”. ACM Transactions on Office
Information Systems, 3(3):253–278, July 1985.

[HM00] Volker Haarslev and Ralf Möller. “Expressive ABox Reasoning with
Number Restrictions, Role Hierarchies, and Transitively Closed
Roles”. In Fausto Giunchiglia and Bart Selman, editors, Proceed-
ings of Seventh International Conference on Principles of Knowl-
edge Representation and Reasoning (KR’2000), Breckenridge, Col-
orado, USA, April 2000.

[Hor98] Ian Horrocks. “Using an Expressive Description Logic: FaCT or
Fiction?”. In A. G. Cohn, L. Schubert, and S. C. Shapiro, editors,
Principles of Knowledge Representation and Reasoning: Proceed-
ings of the Sixth International Conference (KR’98), pages 636–647.
Morgan Kaufmann Publishers, June 1998.

[HS97] M. Huhns and M. P. Singh. “Ontologies for Agents”. E-commerce,
IEEE Internet Computing, 1(6):81–83, November–December 1997.
BIBLIOGRAPHY 141

[HU79] John E. Hopcroft and Jeffrey D. Ullman. “Introduction to Automata
Theory, Languages, and Computation”. Addison-Wesley Publishing
Company, Reading, Massachusetts, 1979.

[JCL+ 96] N. R. Jennings, J. Corera, I. Laresgoiti, E. H. Mamdani, F. Per-
riolat, P. Skarek, and L. Z. Varga. “Using ARCHON to Develop
Real-world DAI Applications for Electricity Transportation Man-
agement and Particle Accelerator Control”. IEEE Expert, 11(6),
1996.

[Jen99] Nicholas R. Jennings. “Agent-based Computing: Promise and Per-
ils”. In Proceedings of the International Joint Conference on Ar-
tificial Intelligence (IJCAI’99), Stockholm, Sweden, 1999. Morgan
Kaufmann Publishers.

[JFJ+ 96] N. R. Jennings, P. Faratin, M. J. Johnson, T. J. Norman, P. O’Brien,
and M. E. Wiegand. “Agent-based Business Process Management”.
International Journal of Cooperative Information Systems, 5(2 and
3):105–130, 1996.

[JFN+ 00] N. R. Jennings, P. Faratin, T. J. Norman, P. O’Brien, B. Odgers,
and J. L. Alty. “Implementing a Business Process Management
System using ADEPT: A Real-World Case Study”. International
Journal of Applied Artificial Intelligence, 14(3), 2000.

[JGJ+ 95] M. Jarke, R. Gallersdörfer, M.A. Jeusfeld, M. Staudt, and S. Eherer.
“ConceptBase – A Deductive Object Base for Meta Data Manage-
ment”. Journal of Intelligent Information Systems, Special Issue
on Advances in Deductive Object-Oriented Databases, 4(2):167–192,
1995.

[JLVV00] Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, and Panos
Vassiliadis. Fundamentals of Data Warehouses. Springer-Verlag,
2000.

[JNF98] N. R. Jennings, T. J. Norman, and P. Faratin. “ADEPT: An Agent-
based Approach to Business Process Management”. ACM SIGMOD
Record, 27(4):32–39, 1998.

[Joh90] David S. Johnson. “A Catalog of Complexity Classes”. In Jan
van Leeuwen, editor, Handbook of Theoretical Computer Science,
volume 1, chapter 2, pages 67–161. Elsevier Science Publishers B.V.,
1990.
142 BIBLIOGRAPHY

[JW00] Nicholas R. Jennings and Michael Wooldridge. “Agent-Oriented
Software Engineering”. In Jeffrey Bradshaw, editor, Handbook of
Agent Technology. AAAI/MIT Press, 2000.

[Kan90] Paris C. Kanellakis. “Elements of Relational Database Theory”. In
Jan van Leeuwen, editor, Handbook of Theoretical Computer Sci-
ence, volume 2, chapter 17, pages 1074–1156. Elsevier Science Pub-
lishers B.V., 1990.

[KDB98] Anthony Kosky, Susan Davidson, and Peter Buneman. “Semantics
of Database Transformations”. In L. Libkin and B. Thalheim, edi-
tors, Semantics of Databases. Springer LNCS 1358, February 1998.

[Kim95] Won Kim, editor. Modern Database Systems: The Object Model,
Interoperability, and Beyond. Addison-Wesley, 1995.

[KJ99] S. Kalenka and N. R. Jennings. “Socially Responsible Decision Mak-
ing by Autonomous Agents”. In K. Korta, E. Sosa, and X. Arrazola,
editors, Cognition, Agency and Rationality, pages 135–149. Kluwer,
1999.

[KL89] Michael Kifer and Georg Lausen. “F-Logic: A Higher-Order Lan-
guage for Reasoning about Objects, Inheritance, and Scheme”. In
Proceedings of the 1989 ACM SIGMOD International Conference
on Management of Data (SIGMOD’89), pages 134–146, Portland,
OR USA, 1989.

[Klu88] Anthony Klug. “On Conjunctive Queries Containing Inequalities”.
Journal of the ACM, 35(1):146–160, January 1988.

[KS92] Henry Kautz and Bart Selman. “Planning as Satisfiability”. In Pro-
ceedings of the 10th European Conference on Artificial Intelligence
(ECAI’92), Vienna, August 1992.

[KW96] Chung T. Kwok and Daniel S. Weld. “Planning to Gather Informa-
tion”. In Proc. AAAI’96, Portland, OR, August 1996.

[Lev00] Alon Y. Levy. “Answering Queries Using Views: A Survey”, 2000.
Submitted for publication.

[LGP+ 90] D.B. Lenat, R.V. Guha, K. Pittman, D. Pratt, and M. Shepherd.
“Cyc: Toward Programs with Common Sense”. Communications
of the ACM, 33(8):30–49, 1990.

[LHC] http://lhc.web.cern.ch/lhc/.
BIBLIOGRAPHY 143

[LMSS95] Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, and Divesh
Srivastava. “Answering Queries Using Views”. In Proceedings of
the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems (PODS) 1995, San Jose, CA, 1995.

[LR96] Alon Y. Levy and Marie-Christine Rousset. “CARIN: A Represen-
tation Language Combining Horn Rules and Description Logics”.
In Proc. 12th European Conference of Artificial Intelligence, 1996.

[LRO96] Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. “Query-
ing Heterogeneous Information Sources Using Source Descriptions”.
In Proceedings of the 1996 International Conference on Very Large
Data Bases (VLDB’96), pages 251–262, 1996.

[LRV88] Christophe Lécluse, Philippe Richard, and Fernando Velez. “O2,
an Object-oriented Data Model”. In Proceedings of the 1988 ACM
SIGMOD International Conference on Management of Data (SIG-
MOD’88), pages 424–433, Chicago, IL USA, June 1988.

[LS97] Alon Y. Levy and Dan Suciu. “Deciding Containment for Queries
with Complex Objects (Extended Abstract)”. In Proceedings of
the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems, May 11–15, 1997, Tucson, AZ USA, pages 20–
31, Tucson, Arizona, 1997.

[LSS99] Laks V. S. Lakshmanan, Fereidoon Sadri, and Subbu N. Subrah-
manian. “On Efficiently Implementing SchemaSQL and a SQL
Database System”. In Proceedings of the 25th International Confer-
ence on Very Large Data Bases (VLDB’99), Edinburgh, Scotland,
1999.

[Mae94] Pattie Maes. “Agents that Reduce Work and Information Over-
load”. Communications of the ACM, 37(7), July 1994.

[MB87] R. MacGregor and R. Bates. “The LOOM Knowledge Representa-
tion Language”. Technical Report ISI/RS-97-188, USC/ISI, 1987.

[MHH+ 01] R. Miller, M. Hernandez, L. Hass, L. Yan, C. Ho, R. Fagin, and
L. Popa. “The Clio Project: Managing Heterogeneity”. SIGMOD
Record, 30(1), March 2001.

[MIKS00] E. Mena, A. Illarramendi, V. Kashyap, and A. Sheth. “OBSERVER:
An Approach for Query Processing in Global Information Systems
based on Interoperation across Pre-existing Ontologies”. Inter-
national Journal of Distributed and Parallel Databases (DAPD),
8(2):223–271, 2000.
144 BIBLIOGRAPHY

[MKSI96] Eduardo Mena, Vipul Kashyap, Amit P. Sheth, and Arantza Il-
larramendi. “OBSERVER: An Approach for Query Processing in
Global Information Systems based on Interoperation across Pre-
existing Ontologies”. In Proceedings First IFCIS International Con-
ference on Cooperative Information Systems (CoopIS’96), pages 14–
25, Brussels, Belgium, June 1996. IEEE Computer Society Press.

[MKW00] Prasenjit Mitra, Martin Kersten, and Gio Wiederhold. “A Graph-
Oriented Model for Articulation of Ontology Interdependencies”.
In Proceedings of the 7th International Conference on Extending
Database Technology (EDBT 2000), Konstanz, Germany, March
2000. Springer Verlag.

[MLF00] Todd Millstein, Alon Levy, and Marc Friedman. “Query Contain-
ment for Data Integration Systems”. In Proceedings of the ACM
SIGACT-SIGMOD-SIGART Symposium on Principles of Database
Systems (PODS) 2000, Dallas, Texas, May 2000.

[MMS79] David Maier, Alberto O. Mendelzon, and Yehoshua Sagiv. “Test-
ing Implications of Data Dependencies”. ACM Transactions on
Database Systems, 4(4):455–469, 1979.

[MY95] Weiyi Meng and Clement Yu. “Query Processing in Multidatabase
Systems”. In Won Kim, editor, Modern Database Systems: The Ob-
ject Model, Interoperability, and Beyond, pages 551–572. Addison-
Wesley, 1995.

[MZ98] Tova Milo and Sagit Zohar. “Using Schema Matching to Simplify
Heterogeneous Data Translation”. In Proceedings of the 1998 Inter-
national Conference on Very Large Data Bases (VLDB’98), August
1998.

[NBN99] M. Nodine, W. Bohrer, and A. Ngu. “Semantic Brokering over Dy-
namic Heterogeneous Data Sources in InfoSleuth”. In Proceedings
of the 15th IEEE International Conference on Data Engineering
(ICDE’99), 1999.

[Neb89] Bernhard Nebel. “What is Hybrid in Hybrid Representation Sys-
tems?”. In F. Gardin, G. Mauri, and M. G. Filippini, editors,
Proceedings of the International Symposium on Computational In-
telligence’89, pages 217–228, Amsterdam, The Netherlands, 1989.
North-Holland.

[New82] Allen Newell. “The Knowledge Level”. Artificial Intelligence, 18:87–
127, 1982.
BIBLIOGRAPHY 145

[New93] Allen Newell. “Reflections on the Knowledge Level”. Artificial In-
telligence, 59:31–38, 1993.

[NPU98] M. Nodine, P. Perry, and A. Unruh. “Experience with the InfoSleuth
Agent Architecture”. In Proceedings of the AAAI-98 Workshop on
Software Tools for Developing Agents, 1998.

[NU97] M. Nodine and A. Unruh. “Facilitating Open Communication in
Agent Systems: the InfoSleuth Infrastructure”. In Proceedings of
ATAL-97, 1997.

[NvL88] Bernhard Nebel and Kai von Luck. Hybrid reasoning in BACK. In
Z. W. Ras and L. Saitta, editors, “Proceedings of the Third Interna-
tional Symposium on Methodologies for Intelligent Systems”, pages
260–269, Amsterdam, The Netherlands, 1988. North-Holland.

[Nwa96] Hyacinth S. Nwana. “Software Agents: An Overview”. Knowledge
Engineering Review, 11(3):1–40, September 1996.

[OV99] M. Tamer Özsu and Patrick Valduriez. Principles of Distributed
Database Systems. Prentice Hall, 1999.

[Pap94] Christos H. Papadimitriou. Computational Complexity. Addison-
Wesley, 1994.

[PGMW95] Yannis Papakonstantinou, Hector Garcia-Molina, and Jennifer
Widom. “Object Exchange Across Heterogeneous Information Sys-
tems”. In Proceedings of the 11th IEEE International Conference
on Data Engineering (ICDE’95), March 1995.

[PGR98] C. Petrie, S. Goldmann, and A. Raquet. “Agent-Based Process
Management”. In Proc. of the International Workshop on Intelli-
gent Agents in CSCW, Deutsche Telekom, Dortmund, pages 1–17,
September 1998.

[PHG+ 99] A. Preece, K. Hui, A. Gray, P. Marti, T. Bench-Capon, D. Jones,
and Z. Cui. “The KRAFT Architecture for Knowledge Fusion and
Transformation”. In Proceedings of the Nineteenth SGES Interna-
tional Conference on Knowledge Based Systems and Applied Artifi-
cial Intelligence (ES’99), Cambridge, UK, 1999.

[PL00] Rachel Pottinger and Alon Y. Levy. “A Scalable Algorithm for
Answering Queries Using Views”. In Proceedings of the 26th In-
ternational Conference on Very Large Data Bases (VLDB’2000),
2000.
146 BIBLIOGRAPHY

[PSS93] Peter F. Patel-Schneider and William Swartout. “Description Logic
Knowledge Representation System Specification from the KRSS
Group of the ARPA Knowledge Sharing Effort”, November 1993.

[PV99] Yannis Papakonstantinou and Vasilis Vassalos. “Query Rewrit-
ing for Semistructured Data”. In Proceedings of the 1999 ACM
SIGMOD International Conference on Management of Data (SIG-
MOD’99), 1999.

[PW92] J. S. Penberthy and D. Weld. “UCPOP: A Sound, Complete,
Partial-Order Planner for ADL”. In Third International Conference
on Knowledge Representation and Reasoning (KR-92), Cambridge,
MA, October 1992.

[PWC95] C. Petrie, T. Webster, and M. Cutkowsky. “Using Pareto Optimality
to Coordinate Distributed Agents”. AIEDAM, 9:269–281, 1995.

[Qia96] Xiaolei Qian. “Query Folding”. In Proceedings of the 12th IEEE
International Conference on Data Engineering (ICDE’96), pages
48–55, New Orleans, LA, 1996.

[RN95] Stuart Russell and Peter Norvig. Artificial Intelligence - A Modern
Approach. Prentice Hall, NJ, 1995.

[Ros99] Riccardo Rosati. “Towards Expressive KR Systems Integrating Dat-
alog and Description Logics: Preliminary Report”. In Proc. DL’99,
1999.

[RS97] Mary Tork Roth and Peter Schwarz. “Don’t Scrap It, Wrap It!
A Wrapper Architecture for Legacy Data Sources”. In Proceedings
of the 1997 International Conference on Very Large Data Bases
(VLDB’97), 1997.

[RSU95] Anand Rajaraman, Yehoshua Sagiv, and Jeffrey D. Ullman. “An-
swering Queries Using Templates with Binding Patterns”. In Pro-
ceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on
Principles of Database Systems (PODS) 1995, pages 105–112, 1995.

[RVW99] C. M. Rood, D. Van Gucht, and F. I. Wyss. “MD-SQL: A Language
for Meta-data Queries over Relational Databases”. Technical Report
TR528, Dept. of CS, Indiana University, 1999.

[RZA95] Paul Resnick, Richard Zeckhauser, and Chris Avery. “Roles for
Electronic Brokers”. In G.W. Brock, editor, Toward a Competitive
Telecommunication Industry, pages 289–304. Lawrence Erlbaum As-
sociates, Mahwah, NJ, 1995.
BIBLIOGRAPHY 147

[Sar91] Y. Saraiya. “Subtree Elimination Algorithms in Deductive Data-
bases”. PhD thesis, Department of Computer Science, Stanford
University, January 1991.

[SCB+ 98] I.A. Smith, P.R. Cohen, J.M. Bradshaw, M. Greaves, and H. Holm-
back. “Designing Conversation Policies using Joint Intention The-
ory”. In Proc. International Joint Conference on Multi-Agent Sys-
tems (ICMAS-98), Paris, France, July 1998.

[SCH+ 97] Munindar P. Singh, Philip Cannata, Michael N. Huhns, Nigel
Jacobs, Tomasz Ksiezyk, KayLiang Ong, Amit P. Sheth, Chris-
tine Tomlinson, and Darrell Woelk. “The Carnot Heterogeneous
Database Project: Implemented Applications”. Distributed and
Parallel Databases, 5(2):207–225, 1997.

[SDJL96] Divesh Srivastava, Shaul Dar, H. V. Jagadish, and Alon Y. Levy.
“Answering Queries with Aggregation Using Views”. In Proceedings
of the 1996 International Conference on Very Large Data Bases
(VLDB’96), pages 318–329, 1996.

[Sea69] John R. Searle. Speech Acts: An Essay in the Philosophy of Lan-
guage. Cambridge University Press, Cambridge, 1969.

[SGV99] W. Swartout, Y. Gil, and A. Valente. “Representing Capabilities of
Problem Solving Methods”. In Proc. IJCAI-99 Workshop on On-
tologies and Problem-Solving Methods: Lessons Learned and Future
Trends, Stockholm, Sweden, August 1999.

[Shm87] Oded Shmueli. “Decidability and Expressiveness Aspects of
Logic Queries”. In Proceedings of the 6th ACM SIGACT-
SIGMOD-SIGART Symposium on Principles of Database Systems
(PODS’87), pages 237–249, 1987.

[Sho93] Yoav Shoham. “Agent-Oriented Programming”. Artificial Intelli-
gence, 60(1):51–92, 1993.

[SHWK76] Michael Stonebraker, Gerald Held, Eugene Wong, and Peter Kreps.
“The Design and Implementation of INGRES”. ACM Transactions
on Database Systems, 1(3):189–222, 1976.

[Sip97] Michael F. Sipser. Introduction to the Theory of Computation. PWS
Publishing, 1997.

[SL90] Amit P. Sheth and James A. Larson. “Federated Database Sys-
tems for Managing Distributed, Heterogeneous and Autonomous
Databases”. ACM Computing Surveys, 22(3), September 1990.
148 BIBLIOGRAPHY

[SL95] Tuomas Sandholm and Victor Lesser. “Issues in Automated Ne-
gotiation and Electronic Commerce: Extending the Contract Net
Framework”. In 1st International Conference on Multiagent Sys-
tems (ICMAS), pages 328–335, San Francisco, 1995.

[SLK98] Katia Sycara, J. Lu, and Matthias Klusch. “Interoperability among
Heterogeneous Software Agents on the Internet”. Technical report,
Carnegie-Mellon University, Pittsburgh, USA, 1998.

[Smi80] Reid G. Smith. “The Contract Net Protocol: High-Level Com-
munication and Control in a Distributed Problem Solver”. IEEE
Transactions on Computers, 29(12):1104–1113, December 1980.

[SPVG01] K. Sycara, M. Paolucci, M. Van Velsen, and J.A. Giampapa. “The
RETSINA MAS Infrastructure”. Technical Report CMU-RI-TR-
01-05, Robotics Institute, Carnegie Mellon University, March 2001.

[SS89] Manfred Schmidt-Schauss. “Subsumption in KL-ONE is Undecid-
able”. In Proceedings of the 1st International Conference on Prin-
ciples of Knowledge Representation and Reasoning (KR’89), pages
421–431. Morgan Kaufmann, 1989.

[SSS91] Manfred Schmidt-Schauss and Gert Smolka. “Attributive Concept
Descriptions with Complements”. Artificial Intelligence, 48(1):1–
26, 1991.

[SY80] Yehoshua Sagiv and Mihalis Yannakakis. “Equivalences Among
Relational Expressions with the Union and Difference Operators”.
Journal of the ACM, 27(4):633–655, 1980.

[TBM99] P. Tsompanopoulou, L. Bölöni, and D. C. Marinescu. “The De-
sign of Software Agents for a Network of PDE Solvers”. In Proc.
Workshop on Autonomous Agents in Scientific Computing at Au-
tonomous Agents 1999, pages 57–68, 1999.

[TK78] D. Tsichritzis and A. Klug. “The ANSI/X3/SPARC DBMS Frame-
work”. Information Systems, 3(4), 1978.

[TMD92] J. Thierry-Mieg and R. Durbin. “Syntactic Definitions for the
ACeDB Data Base Manager”, 1992.

[TSI94] Odysseas G. Tsatalos, Marvin H. Solomon, and Yannis E. Ioannidis.
“The GMAP: A Versatile Tool for Physical Data Independence”.
In Proceedings of the 1994 International Conference on Very Large
Data Bases (VLDB’94), 1994.
BIBLIOGRAPHY 149

[TYF86] Toby J. Teorey, Dongqing Yang, and James P. Fry. “A Logical
Design Methodology for Relational Databases using the Extended
Entity-Relationship Model”. ACM Computing Surveys, 18(2):197–
222, 1986.

[Ull88] Jeffrey D. Ullman. Principles of Database & Knowledge-Base Sys-
tems Vol. 1. Computer Science Press, December 1988.

[Ull89] Jeffrey D. Ullman. Principles of Database & Knowledge-Base Sys-
tems Vol. 2: The New Technologies. Computer Science Press, 1989.

[Ull97] Jeffrey D. Ullman. “Information Integration Using Logical Views”.
In Proc. ICDT’97, pages 19–40, 1997.

[Var82] Moshe Y. Vardi. “The Complexity of Relational Query Languages”.
In Proc. 14th Annual ACM Symposium on Theory of Computing
(STOC’82), pages 137–146, San Francisco, CA, May 1982.

[Var97] Moshe Y. Vardi. “Why is Modal Logic so Robustly Decidable”. In
DIMACS Series in Discrete Mathematics and Theoretical Computer
Science 31, American Math. Society, pages 149–184, 1997.

[vdM92] Ron van der Meyden. “The Complexity of Querying Indefinite In-
formation about Linearly Ordered Domains”. In Proceedings of the
11th ACM SIGACT-SIGMOD-SIGART Symposium on Principles
of Database Systems (PODS’92), pages 331–345, San Diego, June
1992. ACM Press.

[vLNPS87] K. von Luck, B. Nebel, C. Peltason, and A. Schmiedel. “The
Anatomy of the BACK System”. Technical Report 41, KIT
(Künstliche Intelligenz und Textverstehen), Technical University of
Berlin, January 1987.

[VV98] Sergei Vorobyov and Andrei Voronkov. “Complexity of Nonrecursive
Logic Programs with Complex Values”. In Proceedings of the ACM
SIGACT-SIGMOD-SIGART Symposium on Principles of Database
Systems (PODS) 1998, 1998.

[WBLX00] T. Wagner, B. Benyo, V. Lesser, and P. Xuan. “Investigat-
ing Interactions Between Agent Conversations and Agent Control
Components”. In Frank Dignum and Mark Greaves, editors, Is-
sues in Agent Communication, Lecture Notes in Computer Science.
Springer-Verlag, Berlin, April 2000.

[Wei99] Gerhard Weiss. Multiagent Systems: A Modern Approach to Dis-
tributed Artificial Intelligence. MIT Press, 1999.
150 BIBLIOGRAPHY

[Wel99] Daniel S. Weld. “Recent Advances in AI Planning”. AI Magazine,
20(2):93–123, 1999.

[Wid96] Jennifer Widom. “Integrating Heterogeneous Databases: Lazy or
Eager?”. ACM Computing Surveys, 28A(4), December 1996.

[Wie92] Gio Wiederhold. “Mediators in the Architecture of Future Informa-
tion Systems”. IEEE Computer, 25(3):38–49, March 1992.

[Wie96] Gio Wiederhold, editor. Intelligent Integration of Information.
Kluwer Academic Publishers, Boston, July 1996.

[WJ95] Michael J. Wooldridge and Nicholas R. Jennings. “Intelligent
Agents: Theory and Practice”. Knowledge Engineering Review,
10(2), June 1995.

[Wor01] World Wide Web Consortium. Semantic Web Activity Home Page,
2001. http://www.w3.org/2001/sw/.

[WT98] Gerhard Wickler and Austin Tate. “Capability Representa-
tions for Brokering: A Survey”, November 1998. Available at
http://www.aiai.ed.ac.uk/∼oplan/cdl/.

[YL87] H. Z. Yang and Per-Åke Larson. “Query Transformation for PSJ-
Queries”. In Proceedings of the 13th International Conference on
Very Large Data Bases (VLDB’87), pages 245–254, Brighton, Eng-
land, 1987.

[YO79] C. T. Yu and M. Özsoyoglu. “An Algorithm for Tree-Query Mem-
bership of a Distributed Query”. In Proc. IEEE COMPSAC, pages
306–312, 1979.

[Zan96] Carlo Zaniolo. “A Short Overview of LDL++: A Second-Generation
Deductive Database System”. Computational Logic, 3(1):87–93, De-
cember 1996.

[ZHKF95a] Gang Zhou, Richard Hull, Roger King, and Jean-Claude Franchitti.
“Supporting Data Integration and Warehousing Using H2O”. IEEE
Data Engineering, 18(2):29–40, June 1995.

[ZHKF95b] Gang Zhou, Richard Hull, Roger King, and Jean-Claude Franchitti.
“Using Object Matching and Materialization to Integrate Heteroge-
neous Databases”. In S. Laufmann, S. Spaccapietra, and T. Yokoi,
editors, Proc. of the 3rd Int. Conf. on Cooperative Information Sys-
tems (CoopIS’95), pages 4–18, Vienna, Austria, May 1995.