158 views

Uploaded by gbvico

Research in the area of data integration has resulted in approaches such as fed-
erated and multidatabases, mediation, data warehousing, global information sys-
tems, and the model management/schema matching approach. Architecturally,
approaches can be categorized into those that integrate against a single global
schema and those that do not, while on the level of inter-schema constraints,
most work can be classiﬁed either as so-called global-as-view or as local-as-view
integration. These approaches differ widely in their strengths and weaknesses.
Federated databases have been found applicable in environments in which
several autonomous information systems coexist – each with their individual
schemata – and need to share data. However, this approach does not provide
suffcient support for dealing with change of schemata and requirements. Other
approaches to data integration which are centered around a single “global” inte-
gration schema, on the other hand, cannot handle design autonomy of information
systems. Under evolution, this type of autonomy eventually leads to schemata
between which neither the global-as-view nor the local-as-view approaches to
source integration can be used to express the inter-schema semantics.
In this thesis, this issue is addressed with a novel approach to data integration
which combines techniques from model management, mediation, and local-as-
view integration. It allows for the design of inter-schema mappings that are more
robust when change occurs. The work has been motivated by the requirements
of large scientiﬁc collaborations in high-energy physics, as encountered by the author during his stay at CERN.

save

- Review of the Data Dictionary
- Payroll Functions
- dbms -1
- Camstar Semiconductor Suite course
- Garrido Chapter1
- Modeling Technique
- Overview of Pricing Procedure in SAP
- owbpart1
- Computer Assisted Construction Plan
- GAP MODEL
- Object Modeling Techniques
- WP5_D5.7_dissemination of Results by Publications and Open Seminars_EAP
- A general Semiconductor process modeling framework.pdf
- Domain Modelling
- Cosc5336 Syllabus
- Bat
- 1999, Hulland, Item_indicator Reliability
- gupea_2077_10532_1
- Modeling Contaminant Transport With Aerobic Biodegradation in a Shallow Water Body
- PAPER - Service Quality - Concepts and Models[1626]
- book rental system
- Theory of the Gaps Model
- Next Generation Responsive Design: A Look at Responsive Content Strategy and Content Model Personalization
- Interface Metaphors and Conceptual Models
- Managerial Reporting Tutorial
- c 1943110
- Confjnj
- Introduction to Theory
- taksonomi solo and bloom.pptx
- Org Swap Analysis
- Guiding Interactive Drama
- ZammittoV Gamer sPersonalityAndTheirGamingPreferences MSc 2010 SFU
- ReillyWSN Believable Social and Emotional Agents PhD CMU CS 96 138
- JhalaA Cinematic Discourse Generation PhD NCStateUniv 2009
- KelleherC Motivating Programming Using Storytelling to Make Computer Programming Attractive to Middle Schoolgirls PhD CMU CS 06 171 2006

You are on page 1of 150

DISSERTATION

**Data Integration against
**

Multiple Evolving Autonomous Schemata

**ausgeführt zum Zwecke der Erlangung des akademischen Grades
**

eines Doktors der technischen Wissenschaften unter der Leitung

von

**o. Univ.-Prof. Dr. Robert Trappl
**

Institut für medizinische Kybernetik und Artificial Intelligence

Universität Wien

und

**Universitätslektor Dipl.-Ing. Dr. Paolo Petta
**

Institut für medizinische Kybernetik und Artificial Intelligence

Universität Wien

**eingereicht an der Technischen Universität Wien
**

Fakultät für Technische Naturwissenschaften und Informatik

von

Christoph Koch

E9425227

A-1030 Wien, Beatrixgasse 26/70

Wien, am

2

3

Abstract

Research in the area of data integration has resulted in approaches such as fed-

erated and multidatabases, mediation, data warehousing, global information sys-

tems, and the model management/schema matching approach. Architecturally,

approaches can be categorized into those that integrate against a single global

schema and those that do not, while on the level of inter-schema constraints,

most work can be classified either as so-called global-as-view or as local-as-view

integration. These approaches differ widely in their strengths and weaknesses.

Federated databases have been found applicable in environments in which

several autonomous information systems coexist – each with their individual

schemata – and need to share data. However, this approach does not provide

sufficient support for dealing with change of schemata and requirements. Other

approaches to data integration which are centered around a single “global” inte-

gration schema, on the other hand, cannot handle design autonomy of information

systems. Under evolution, this type of autonomy eventually leads to schemata

between which neither the global-as-view nor the local-as-view approaches to

source integration can be used to express the inter-schema semantics.

In this thesis, this issue is addressed with a novel approach to data integration

which combines techniques from model management, mediation, and local-as-

view integration. It allows for the design of inter-schema mappings that are more

robust when change occurs. The work has been motivated by the requirements

of large scientific collaborations in high-energy physics, as encountered by the

author during his stay at CERN.

The approach presented here is based on two foundations. The first is query

rewriting with very expressive symmetric inter-schema constraints, called con-

junctive inclusion dependencies (cind’s). These are containment relationships

between conjunctive queries. We address a very general form of the source inte-

gration problem, in which several schemata may coexist, each of them containing

a number of purely logical as well as a number of source entities. For the source

entities, the information system that belongs to the schema holds data, while the

logical entities are meant to allow schema entities from other information systems

to be integrated against. The query rewriting problem now aims at rewriting a

query over (possibly) both source and logical schema entities of one schema into

source entities only, which may be part of any of the schemata known. Under the

classical logical semantics, and given a conjunctive input query, we address the

problem of finding maximally contained positive rewritings under a set of cind’s.

Such rewritten queries can then be optimized and efficiently answered using clas-

sical distributed database techniques. For the purpose of data integration and

the sake of computability, we require the dependency graph of a set of cind’s to

be acyclic with respect to inclusion direction.

Regarding the query rewriting problem, we first present semantics and main

theoretical properties. Subsequently, algorithms and optimizations based on tech-

4

**niques from database theory are presented, which have been implemented in a
**

research prototype. Finally, experimental results based on this prototype are

presented, which demonstrate the practical feasibility of our approach.

Reasoning is done exclusively over schemata and queries, and is independent

from data volumes, which renders it highly scalable. Apart from that, this flavor

of query rewriting has another important strength. The expressiveness of the

constraints allows for much freedom and flexibility for modeling the peculiarities

of a mapping problem. For instance, both global-as-view and local-as-view inte-

gration are special cases of the query rewriting problem addressed in this thesis.

As will be shown, this flexibility allows to design mappings that are robust with

respect to change, as principles such as the decoupling of inter-schema dependen-

cies can be implemented. It is furthermore clear that query rewriting with cind’s

also permits to deal with concept mismatch in a very wide sense, as each pair of

corresponding concepts in two schemata can be modeled as conjunctive queries.

The second foundation is model management based on cind’s as inter-schema

constraints. Under the model management approach to data integration, sche-

mata and mappings are treated as first-class citizens in a repository, on which

model management operations can be applied. This thesis proposes definitions

of schemata and mappings, as well as an array of powerful operations, which are

well suited for designing and maintaining mappings between information systems

when change is an issue. To complete this work, we propose a methodology for

dealing with evolving schemata as well as changing integration requirements.

The combination of the contributions of this thesis brings a practical improve-

ment of openness and flexibility to the federated database and model management

approaches to data integration, and a first practical integration architecture to

large, complex, and evolving computing environments such as those encountered

in large scientific collaborations.

5

Inhaltsangabe

Forschung im Gebiet der Datenintegration hat u.a. Richtungen wie föderierte und

Multidatenbanken, Mediation, Data Warehousing, Global Information Systems

und Model Management bzw. Schema Matching zu Tage gebracht. Von einem

architektonischen Standpunkt aus gesehen kann zwischen Ansätzen unterschieden

werden, in denen gegen ein einziges globales Schema integriert wird, und solchen,

wo das nicht der Fall ist. Auf der Ebene der Interschemasemantik kann man den

Großteil der bisherigen Forschungsarbeit in die sogenannten global-as-view und

local-as-view Ansätze einteilen. Diese Ansätze unterscheiden sich teilweise stark

in ihren individuellen Eigenschaften.

Föderierte Datenbanken haben sich in Umgebungen als brauchbar erwiesen,

in denen mehrere Informationssysteme miteinander Daten austauschen müssen,

jedes dieser Informationssysteme aber sein eigenes Schema hat, und, was das

Design dieses Schemas betrifft, auch autonom ist. In der Praxis unterstützt dieser

Ansatz aber unangenehmerweise die Wartung von sich ändernden Schemata nicht.

Andere bekannte Ansätze, die gegen ein “globales” Schema integrieren, unterstü-

tzen hingegen die Design Autonomy von Informationssystemen nicht. Bei not-

wendig werdenden Schemaänderungen führt diese Art von Autonomie nämlich

oft zu Schemata, gegen die die erwünschte Interschemasemantik weder durch

global-as-view noch durch local-as-view-Ansätze ausgedrückt werden kann.

Diese Problematik ist das Thema dieser Dissertation, in der ein neuer Ansatz

zur Datenintegration, der Ideen von Model Management, Mediation, and local-

as-view Integration vereint, vorgeschlagen wird. Unser Ansatz ermöglicht die Mo-

dellierung von (partiellen) Abbildungen zwischen Schemata, die Änderungen eine

vorteilhafte Robustheit entgegensetzen. Die Motivation für die präsentierten Re-

sultate ist Folge eines ausgedehnten Aufenthalts des Autors am CERN, während-

dessen die die Informationsinfrastruktur betreffenden Ziele und Notwendigkeiten

von großen wissenschaftlichen Kollaborationen studiert wurden.

Unser Ansatz basiert auf zwei zentralen Grundlagen. Die erste ist Query

Rewriting, also das Umschreiben von Abfragen, unter sehr ausdrucksstarken

“symmetrischen” Interschemaabhängigkeiten, nämlich Inklusionsabhängigkeiten

zwischen sogenannten Conjunctive Queries, die wir Conjunctive Inclusion Depen-

dencies (cind’s) nennen. Wir behandeln eine sehr allgemeine Form des Quellen-

integrationsproblems, in dem mehrere Schemata koexistieren dürfen, und jedes

davon sowohl echte Datenbankentititäten, für die also Daten vorhanden sind,

sowie rein logische oder “virtuelle” Entititäten enthalten darf, gegen die mit Hilfe

von cind’s Abhängigkeiten von anderen Schemata definiert werden können. Das

Query Rewriting Problem zielt nun darauf ab, eine Abfrage, die sowohl über lo-

gische als auch echte Entititäten eines Schemas gestellt werden darf, so in eine

andere umzuschreiben, daß nur echte Datenbankentititäten, allerdings, wenn

nötig, von allen dem Integrationssystem bekannten Schemata, verwendet wer-

den. Exakter wird unter der klassisch-logischen Semantik mit Hilfe einer Menge

6

**von cind’s eine Conjunctive Query in eine maximale logisch enthaltene positive
**

Abfrage umgeschrieben. Solch derart umgeschriebene Abfragen können mit Hilfe

von bekannten Techniken aus dem Gebiet der verteilten Datenbanken beant-

wortet werden. Aus theoretischen Überlegungen, die in dieser Dissertation näher

erläutert werden, beschränken wir uns dabei – für die Datenintegration – auf

Mengen von cind’s deren Abhängigkeitsgraph bezogen auf die Inklusionsrichtung

der cind’s azyklisch ist.

Was das Query Rewriting Problem betrifft stellen wir zuerst Semantik(en) und

theoretische Eigenschaften vor. Danach werden Algorithmen und Optimierungen,

die auf Datenbanktechniken aufbauen, präsentiert, die in einem Prototypen im-

plementiert wurden. Zu diesem werden auch passende Benchmarks geliefert, die

zeigen sollen, daß unser Ansatz leistungsfähig genug ist, um auch praktische

Relevanz zu besitzen.

Unser Ansatz skaliert ausgezeichnet zu großen Datenmengen, da das Daten-

integrationsproblem ausschließlich auf der Ebene von Schemata und Abfragen,

nicht aber auf der Ebene von Daten, gelöst wird. Eine weitere Stärke ist die

hohe Ausdruckskraft unserer Abhängigkeiten (cind’s), die viel Flexibilität bei

der Modellierung von Interschemabeziehungen erlaubt; beispielsweise sind sowohl

local-as-view als auch global-as-view Integration Spezialfälle unseres Ansatzes.

Wie auch gezeigt wird, erlaubt diese Flexibilität, Abbildungen zu erzeugen, die

Änderungen gegenüber robust sind, da sie es ermöglicht, cind’s weitgehend un-

abhängig voneinander zu machen, sodaß notwendige Änderungen meist lokal

beschränkt bleiben. Query Rewriting mit cind’s ermöglicht es klarerweise auch,

mit einer sehr großen Klasse von Disparitäten von Konzepten umzugehen, da

Paare von einander entsprechenden (um exakt zu sein, einander enhaltenden)

Konzepten durch zwei in Relation gebrachte Conjunctive Queries ausgedrückt

werden.

Die zweite Grundlage stellt Model Management mit cind’s dar. Im Model

Management Ansatz werden Schemata und Abbildungen als Objekte mit Iden-

tität verwaltet, auf die eine Anzahl von mächtigen Wartungs- und Manipulations-

operationen angewandt werden kann. In dieser Dissertation werden solche Op-

erationen definiert, die dafür passend sind, Abbildungen so zu verwalten, daß

häufige Änderungen handhabbar sind. Dazu wird auch eine Methodologie zum

Management von Schema Evolution präsentiert.

Die Kombination der technischen Beiträge dieser Dissertation ermöglicht eine

deutliche Verbesserung von Offenheit und Flexibilität für die Ansätze Model

Management und föderierte Datenbanken in der Datenintegration und stellt die

erste praktische Lösung der Datenintegrationsprobleme dar, denen im Kontext

von komplexen, autonomen und sich ändernden Informationslandschaften, wie es

große wissenschaftliche Kollaborationen sind, begegnet wird.

7

Acknowledgments

Most of the work on this thesis was carried out during a 30 months stay at CERN,

which was sponsored by the Austrian Federal Ministry of Education, Science and

Culture under the CERN Austrian Doctoral Student Program.

I would like to thank the two supervisors of my thesis, Robert Trappl of the

Department of Medical Cybernetics and Artificial Intelligence of the University

of Vienna and Jean-Marie Le Goff of CERN / ETT Division and the University

of the West of England for their continuous support. This thesis would not have

been possible without their help.

Paolo Petta of the Austrian Research Institute for Artificial Intelligence took

over much of the day-to-day supervision, and I am indebted to him for countless

hours of discussions, proofreading of draft papers, and feedback of any kind.

I would like to thank Enrico Franconi of the University of Manchester for

provoking my interest in local-as-view integration during his short visit at CERN

in early 2000, which has influenced this thesis. I am also indebted to Richard Mc-

Clatchey and Norbert Toth of the University of the West of England and CERN

for valuable comments on parts of an earlier version of this thesis. However,

mistakes, as is obvious, are entirely mine.

8

Contents

1 Introduction 13

1.1 A Brief History of Data Integration . . . . . . . . . . . . . . . . . 13

1.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3 Use Case: Large Scientific Collaborations . . . . . . . . . . . . . . 18

1.4 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . 23

1.5 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 Preliminaries 27

2.1 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2 Query Containment . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.4 Global Query Optimization . . . . . . . . . . . . . . . . . . . . . 34

2.5 Complex Values and Object Identities . . . . . . . . . . . . . . . . 35

3 Data Integration 39

3.1 Definitions and Overview . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Federated and Multidatabases . . . . . . . . . . . . . . . . . . . . 41

3.3 Data Warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Information Integration in AI . . . . . . . . . . . . . . . . . . . . 44

3.4.1 Integration against Ontologies . . . . . . . . . . . . . . . . 44

3.4.2 Capability Descriptions and Planning . . . . . . . . . . . . 45

3.4.3 Multi-agent Systems . . . . . . . . . . . . . . . . . . . . . 47

3.5 Global-as-view Integration . . . . . . . . . . . . . . . . . . . . . . 50

3.5.1 Mediation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5.2 Integration by Database Views . . . . . . . . . . . . . . . 51

3.5.3 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.6 Local-as-view Integration . . . . . . . . . . . . . . . . . . . . . . . 53

3.6.1 Answering Queries using Views . . . . . . . . . . . . . . . 54

3.6.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.6.3 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . 60

3.7 Description Logics-based Information Integration . . . . . . . . . 62

3.7.1 Description Logics . . . . . . . . . . . . . . . . . . . . . . 62

9

10 CONTENTS

**3.7.2 Description Logics as a Database Paradigm . . . . . . . . 63
**

3.7.3 Hybrid Reasoning Systems . . . . . . . . . . . . . . . . . . 65

3.8 The Model Management Approach . . . . . . . . . . . . . . . . . 65

3.9 Discussion of Approaches . . . . . . . . . . . . . . . . . . . . . . . 66

4 Reference Architecture 71

4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 Mediating a Query . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.3 Research Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Query Rewriting 75

5.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3.1 The Classical Semantics . . . . . . . . . . . . . . . . . . . 78

5.3.2 The Rewrite Systems Semantics . . . . . . . . . . . . . . . 82

5.3.3 Equivalence of the two Semantics . . . . . . . . . . . . . . 84

5.3.4 Computability . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.3.5 Complexity of the Acyclic Case . . . . . . . . . . . . . . . 90

5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.5.1 Chain Queries . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.5.2 Random Queries . . . . . . . . . . . . . . . . . . . . . . . 97

5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6 Model Management 99

6.1 Model Management Repositories . . . . . . . . . . . . . . . . . . . 99

6.2 Managing Change . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2.1 Decoupling Mappings . . . . . . . . . . . . . . . . . . . . . 103

6.2.2 Merging Schemata . . . . . . . . . . . . . . . . . . . . . . 107

6.3 Managing the Acyclicity of Constraints . . . . . . . . . . . . . . . 108

7 Outlook 111

7.1 Physical Data Independence . . . . . . . . . . . . . . . . . . . . . 113

7.1.1 The Classical Problem . . . . . . . . . . . . . . . . . . . . 113

7.1.2 Versions of Logical Schemata . . . . . . . . . . . . . . . . 117

7.2 Rewriting Recursive Queries . . . . . . . . . . . . . . . . . . . . . 122

8 Conclusions 127

List of Figures

**1.1 Mappings in LAV (left) and GAV (right). . . . . . . . . . . . . . . 15
**

1.2 The space of objects that can be shared using symmetric map-

pings given true concept mismatch between entities of source and

integration schemata. . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3 Data flow between information systems that manage the steps of

an experiment’s lifecycle. . . . . . . . . . . . . . . . . . . . . . . . 20

1.4 ER diagrams for Example 1.3.1: Electronics database (left) and

product-data management system (right). . . . . . . . . . . . . . 21

1.5 Concept mismatch between PCs of the electronics database and

parts of the product-data management system of “Project1”. . . . 22

1.6 Architecture of the information infrastructure . . . . . . . . . . . 24

**3.1 Artist’s impression of source integration. . . . . . . . . . . . . . . 40
**

3.2 Federated 5-layer schema architecture . . . . . . . . . . . . . . . . 42

3.3 Data warehousing architecture and process. . . . . . . . . . . . . 43

3.4 MAS architectures for the intelligent integration of information.

Arrows between agents depict exemplary communication flows.

Numbers denote logical time stamps of communication flows. . . . 48

3.5 A mediator architecture . . . . . . . . . . . . . . . . . . . . . . . 51

3.6 MiniCon descriptions of the query and views of Example 3.6.1. . . 58

3.7 Comparison of global-as-view and local-as-view integration. . . . . 67

3.8 Comparison of Data Integration Architectures. . . . . . . . . . . . 68

4.1 Reference Architecture . . . . . . . . . . . . . . . . . . . . . . . . 72

**5.1 Hypertile of size i ≥ 2 (left) and the nine possible overlapping
**

hypertiles of size i − 1 (right). . . . . . . . . . . . . . . . . . . . . 91

5.2 Experiments with chain queries and nonlayered chain cind’s. . . . 95

5.3 Experiments with chain queries and two layers of chain cind’s. . . 96

5.4 Experiments with chain queries and five layers of chain cind’s. . . 96

5.5 Experiment with random queries. . . . . . . . . . . . . . . . . . . 97

**6.1 Operations on schemata. . . . . . . . . . . . . . . . . . . . . . . . 100
**

6.2 Operations on mappings. . . . . . . . . . . . . . . . . . . . . . . . 100

11

12 LIST OF FIGURES

**6.3 Complex model management operations. . . . . . . . . . . . . . . 101
**

6.4 Data integration infrastructure of Example 6.2.1. Schemata are

visualized as circles and elementary mappings as arrows. . . . . . 104

6.5 The lifecycle of the mappings of a legacy integration schema. . . . 106

6.6 Merging auxiliary integration schemata to improve maintenance. . 107

6.7 A clustered auxiliary schema. Schemata are displayed as circles

and mappings as arrows. . . . . . . . . . . . . . . . . . . . . . . . 108

**7.1 A cind as an inter-schema constraint (A) compared to a data trans-
**

formation procedure (B). Horizontal lines depict schemata and

small circles depict schema entities. Mappings are shown as thin

arrows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.2 EER diagram of the university domain (initial version). . . . . . . 114

7.3 EER diagram of the university domain (second version). . . . . . 118

7.4 Fixpoint of the bottom-up derivation of Example 7.2.1. . . . . . . 123

Chapter 1

Introduction

**The integration of heterogeneous databases and information systems is an area
**

of high practical importance. The very success of information systems and data

management technology in a short period of time has caused the virtual om-

nipresence of stand-alone systems that manage data – “islands of information” –

that by now have grown too valuable not to be shared. However, this sharing, and

with it the resolution of heterogeneity between systems, entails interesting and

nontrivial problems, which have received much research interest in recent years.

Ongoing research activity, however, is evidence of the fact that many questions

remain unanswered.

**1.1 A Brief History of Data Integration
**

Given a number of heterogeneous information systems, in practice it is not al-

ways desirable or even possible to completely reengineer and reimplement them

to create one homogeneous information system with a single schema (schema

integration [BLN86, JLVV00]). Instead, it is often necessary to perform data

integration [JLVV00], where schemata of heterogeneous information systems are

left unchanged and integration is carried out by transforming queries or data.

To realize such transformations, some flavor of mappings (either procedural code

or declarative inter-schema constraints) between information systems is required.

If the data integration reasoning is entirely effected on the level of queries and

schema-level descriptions, this is usually called query rewriting, while the term

data transformation refers to heterogeneous data themselves being classified,

transformed and fused to appear homogeneous under some integration schema.

Most previous work on data integration can be classified into two major di-

rections by the method by which inter-schema mappings used for integration

are expressed (see e.g. [FLM98, Ull97]). These are called local-as-view (LAV)

[LMSS95, YL87, LRO96, GKD97, AK92, TSI94, CKPS95] and global-as-view

(GAV) [GMPQ+ 97, ACPS96, CHS+ 95, FRV95] integration.

13

14 CHAPTER 1. INTRODUCTION

**The more traditional paradigm is global-as-view integration, where mappings
**

– often called mediators after [Wie92] – are defined as follows. Mediators imple-

ment virtual entities (concepts, relations or classes, depending on nomenclature

and data model used) exported by their interfaces as views over the heteroge-

neous sources, specifying how to combine their data to resolve some (or all) of

the experienced heterogeneity. Such mediators can be (generalizations of) simple

database views (e.g. CREATE VIEW constructs in SQL) or can be implemented

by some procedural code. Global-as-view integration has been used in multi-

databases [SL90], data warehousing [JLVV00], and recently for the integration of

multimedia sources [ACPS96, CHS+ 95] and as a fertile testbed for semistructured

data models and technologies [GMPQ+ 97].

In the local-as-view paradigm, inter-schema constraints are defined in strictly

the opposite way1 . Queries over a purely logical “global” mediated schema are

answered by treating sources as if they were materialized views over the medi-

ated schema, where only these materialized views may be used to answer the

query – after all, the mediated schema does not directly represent any data.

Query answering then reduces to the so-called problem of answering queries

using views, which has been intensively studied by the database community

[LMSS95, DGL00, AD98, BLR97, RSU95] and is related to the query containment

problem [CM77, CV92, Shm87, CDL98a]. Local-as-view integration has not only

been applied to and shown to be well-suited for data integration in global infor-

mation systems [LRO96, GKD97, AK92], but also in related applications beyond

data integration, such as query optimization [CKPS95] and the maintenance of

physical data independence [TSI94].

An important distinction is to be made between data integration architectures

that are centered around a single “global” integration schema against which all

sources are integrated (This is the case, for instance, for data warehouses and

global information systems, and is intrinsic to the local-as-view approach.) and

others that are not, such as federated and multidatabases. The lack of a single

global integration schema in the data integration architecture has a problematic

consequence. Each source may need to be mapped against each of the integration

schemata, leading to a large number of mappings that need to be created and

managed. In architectures such as those of federated database systems where

each component database may be a source and a consumer of integrated data at

once, a quadratic number of mappings may be required.

The globality of integration schemata is usually judged by their role in an

integration architecture. Global schemata are singletons that occupy a very cen-

tral role in the architecture, and are unique consistent and homogeneous world

views against which all other schemata in the system (usually considered the

1

At first sight, this may appear unintuitive, but is not. For instance, the local-as-view

approach can be motivated by AI planning for information gathering using content descriptions

of sources in terms of a global world model (as “planning operators”) [AK92, KW96].

1.1. A BRIEF HISTORY OF DATA INTEGRATION 15

space of tuples

expressible as queries

source over the sources

space of tuples 1

expressible as

queries over the

global schema source tuples

2 accessible

source through

3 mediators

LAV GAV

Figure 1.1: Mappings in LAV (left) and GAV (right).

**“sources”) are to be integrated. There is globality in integration schemata on a
**

different level as well. We want to consider integration schemata as designed at

will while taking a global perspective if

**• they are artifacts specifically created for the resolution of some heterogene-
**

ity and

**• the entirety of sources in the system that have any relevance to those het-
**

erogeneity problems addressed have been taken into account in the design

process.

**Thus in such “global” schemata, a global perspective has been taken when
**

designing them. However, they do not have to be monolithic homogeneous world

views. This qualifies the collection of logical entities exported by mediators in

a global-as-view integration system as a specifically designed global integration

schema, although such a schema is not necessarily homogeneous.

An important characteristic of data integration approaches is how well concept

mismatch occurring between source and integration schemata can be bridged. We

have pointed out that both GAV and LAV use a flavor of views for the mapping

between sources and integration schemata. In Figure 1.1, we compare the local-as-

view and global-as-view paradigms by visualizing (by Venn diagrams) the spaces

of tuples (in relational queries) or objects that can be expressed by queries over

source and integration schemata.

Views as inter-schema constraints are strongly asymmetric. One single atomic

schema entity appearing in a schema on one side of the invisible conceptual border

line between integration and source schemata is always defined by a query or (as

the general idea of mediation permits) by some procedural code which computes

16 CHAPTER 1. INTRODUCTION

**the entity’s extent over the schemata on the other side of that border line. As a
**

consequence, both LAV and GAV are restricted in how well they can deal with

concept mismatch2 .

This restriction is theoretical, because in both LAV and GAV it is always

implicitly assumed that sources are integrated against integration schemata that

have been freely designed with no other constraints imposed than the current

integration requirements3 . However, when data need to be integrated against

schemata of information systems that have design autonomy, or when integration

schemata have a legacy 4 burden that an integration approach has to be able to

deal with, both LAV and GAV fail.

Note that views are not the only imaginable way of mapping schemata in

data integration architectures. For mappings that are not expressible as views,

it may be possible to relate the spaces of objects expressible by complex logical

expressions – say queries – over the concepts of the schemata (see Figure 1.2).

“Legacy” integration schemata are faced when

• there is no central design authority providing “global” schemata,

**• future integration requirements or changes to schemata of information sys-
**

tems cannot be appropriately predicted,

**• existing integration schemata cannot be amended when integration require-
**

ments or the nature of sources to be made available change in an unforeseen

way, or

**• the creation of “global” schemata is infeasible because of the size and com-
**

plexity of the problem domain and modeling task5 [MKW00].

**Recent work in the area has resulted in two new approaches that do not center
**

around a single “global” integration schema and where inter-schema constraints

do not necessarily have that strictly asymmetric syntax encountered in LAV and

GAV. The first uses expressive description logics systems with symmetric con-

straints for data integration [CDL98a, CDL+ 98b, Bor95]. Constraints can be

2

See Example 1.3.1 and [Ull97].

3

This makes the option of the change of requirements or the nature of sources after the design

of the integration schemata has been finished hover over such architectures like Damocles’ sword.

4

We do not refer to the legacy systems issue here, though. In principle, legacy systems are

operational systems that in some aspect of their design differ from what they ideally should

be like; they use at least one technology that is no longer part of the current overall strategy

in some enterprise or collaborative environment [AS99]. In practice, information systems are

usually referred to as legacy in the context of data integration if they are not even based on a

modern data management technology, usually making it necessary to treat them monolithically,

and “wrap” them [GMPQ+ 97, RS97] by software that makes them appear to respond to data

requests under a state-of-the-art data management paradigm.

5

This may make the Semantic Web effort of the World Wide Web Consortium [Wor01] seem

to be threatened by another very sharp blade hanging by an amazingly fragile thread.

1.2. THE PROBLEM 17

**space of tuples that can
**

be made space of

tuples

available to tuples

expressible

queries over the expressible

as queries

integrated as queries

over the

schema by over the

global

mappings from sources

schema

sources

**Figure 1.2: The space of objects that can be shared using symmetric mappings
**

given true concept mismatch between entities of source and integration schemata.

**defined as containment relationships between complex concepts that represent
**

(path) queries. The main drawback is that integration has to be carried out as

ABox reasoning [CDL99], i.e. the classification of data in a (hybrid) description

logics system [Neb89]. This does not scale well to large data volumes. Further-

more, such an approach is not applicable when sources have restricted interfaces

(as is often the case on the Web) and it is not possible to import all data of a

source into the reasoning system.

The second approach, model management [BLP00, MHH+ 01], treats schemata

and mappings between schemata as first-class objects that can be stored in a

repository and manipulated with cleanly defined model management operations.

This direction is still in an early stage and no convergence against clean, widely

usable semantics has occurred yet. Mappings are often defined as lines between

concepts (e.g. relations or classes in schemata) using an array of semantics that

are often not very expressive. While such approaches allow for neat graphical

visualization and the editing of mappings, they do not provide the mechanisms

and expressive semantics to support design and modeling actions to make evolving

schemata manageable.

1.2 The Problem

The problem addressed in this thesis is the following. We aim at an approach to

data integration that satisfies three requirements.

**• Individual information systems may have design autonomy for their sche-
**

mata. In general, no global schemata can be built. Each individual schema

may have been defined before integration requirements were completely

known, and be ill-suited for a particular integration task.

18 CHAPTER 1. INTRODUCTION

**• Individual schemata may evolve independently. Even the best-designed
**

integration schemata may end up with concept mismatch that cannot be

dealt with through view-based mappings.

• The third requirement concerns the scalability of the approach. The data

integration problem has to be solved entirely on the level of queries and

descriptions of information systems (i.e., query rewriting) rather than the

level of reasoning over the data to ensure the independence of the approach

from the amount of data managed.

Given the problem that the number of mappings in data integration architec-

tures with autonomous component systems may be quadratic in the number of

schemata and thus very large, the option that schemata and integration require-

ments change renders a way of managing schemata and mappings necessary that

is simple and for which many tasks can be automated. This requires support for

managing mappings and their change and reusing mappings both actively, in the

actions performed for managing schemata and mappings, and passively, through

the transitivity of their semantics6 .

The work presented in this thesis has been carried out in the context of a very

large international scientific collaboration in the area of high-energy physics. We

will have a closer look at the problem of providing interoperability of information

systems in that domain in Section 1.3.

**1.3 Use Case: Large Scientific Collaborations
**

Large scientific collaborations are becoming more and more common due to the

fact that nowadays cutting-edge scientific research in areas such as high energy

physics, the human genome or aerospace has become extremely expensive. Data

integration is an issue since many of the individual information systems being

operated in such an environment require integrated data to be provided from

other information systems in order to work. As we will point out in this section,

the main sources of difficulty related to source integration in the information

infrastructures of such collaborations are the design autonomy of information

systems, change of requirements and evolution of schemata, and large data sets.

A number of issues stand in the way of building a single unified “global”

logical schema (as they exist for data warehouses or global information systems)

for a large science project. We will summarize them next.

Heterogeneity. Heterogeneity is pervasive in large scientific research collabora-

tions, as there are existing legacy systems as well as largely autonomous groups

that build more such legacy systems.

6

That is, given that we have defined a mapping from schema A to schema B and a mapping

from schema B to schema C, we assume that we automatically arrive at a mapping from schema

A to schema C.

1.3. USE CASE: LARGE SCIENTIFIC COLLABORATIONS 19

**Scientific collaborations consist of a number7 of largely autonomous institutes
**

that independently develop and maintain their individual information systems8 .

This lack of central control fosters creativity and is necessary for political and

organizational reasons. However, it leads to problems when it comes to mak-

ing information systems interoperate. In such a setting, heterogeneity arises due

to many reasons. Firstly, no two designers would conceptualize a given prob-

lem situation in the same way. Furthermore, distinct groups of researchers have

fundamentally different ways of dealing with bodies of knowledge, due to differ-

ent (human) languages, professional background, community or project jargon9 ,

teacher and curriculum, or “school of thought”. Several subcommunities inde-

pendently develop and use similar but distinct software for the same tasks. As a

consequence, one can assume similar but slightly different schemata10 . In an en-

vironment such as the Large Hadron Collider (LHC) project at CERN [LHC] and

huge experiments such as CMS [CMS95] currently under preparation, potentially

hundreds of individual information systems will be involved with the project dur-

ing its lifetime, some of them commercial products, others homegrown efforts of

possibly several hundred person years. This is the case because even for the same

task, sub-collaborations or individual institutes working on different subprojects

independently build systems.

When it comes to types of heterogeneity that may be encountered in such an

environment, it has to be remarked that beyond heterogeneity due to discrepan-

cies in conceptualizations of human designers (including polysemy, terminological

overlap and misalignment), there is also heterogeneity that is intrinsic to the do-

main. For example, in the environment of high-energy physics experiments (say,

a particle detector), detector parts will be necessarily conceptualized differently

depending on the kind of information system in which they are represented. For

instance, in a CAD system that is used for designing the particle detector, parts

will be spatial structures; in a construction management system, they will have

to be represented as tree-like structures modeling compositions of parts and their

sub-parts, and in simulation and experimental data taking, parts have to be

aggregated by associated sensors (readout channels), with respect to which an

experiment becomes a topological structure largely distinct from the one of the

design drawing. We believe that such differences also lead to different views on

the knowledge level, and certainly lead to different database schemata.

Hardness of Modeling. Apart from the notion of intrinsic heterogeneity that

we have given rise to in the previous paragraph, there are a number of other

issues that contribute to the hardness of modeling in a scientific domain. Firstly,

7

In large collaborations, they may amount to hundreds.

8

The requirements presented here closely relate to classifications of component autonomy in

federated databases [HM85].

9

Such jargon may have developed over time in previous projects in which a group of people

may have worked on together.

10

Unfortunately, it is often trickier to deal with subtle than with great mismatch.

20 CHAPTER 1. INTRODUCTION

Design Simulation

Human

Resources

Construction

Precalibration

& Testing Finance

Detector

Control Calibration

Maintenance

Event Simulation

Decomissioning Reconstruction

**Figure 1.3: Data flow between information systems that manage the steps of an
**

experiment’s lifecycle.

**overall agreement on a conceptualization of a large real-world domain cannot be
**

achieved. Whenever new requirements are discovered or a better understanding

of a domain is achieved, there will be an incentive to change the current schema.

Such change may go beyond pure extension. Instead, existing parts of schemata

will have to be revisited, invalidating mappings for data integration that rely on

these schemata. Global modeling also fails because of the sheer size of such a

scientific domain. In fact, in a project that involves the collaboration of several

thousand researchers and engineers, to be able to model the domain would require

to have access to all the knowledge in the heads of all the people involved, and

for this knowledge to be stable. This, however, is an unrealistic conjecture, all

the more so in an experimental research environment.

The Project Lifecycle. It is important to note that large science projects have

a lifecycle much like industrial projects; that is, they go through stages such as

design, simulation, construction, testing, calibration, deployment, decommission-

ing, and many more11 . Such steps have some temporal overlap in practice, but

there is a gross ordering. Large science projects persist for large time spans12 . As

a consequence, the information systems for some steps of the lifecycle will not be

11

See Figure 1.3 for an example of data flows that may need to occur between (heterogeneous)

information systems for the various activities in the lifecycle, all requiring data integration.

12

For example, the LHC project is expected to be carried on for at least 15 years.

1.3. USE CASE: LARGE SCIENTIFIC COLLABORATIONS 21

part_of

pc_cpu

pc cpu part

pc_location part_location

location location

id name id name

**Figure 1.4: ER diagrams for Example 1.3.1: Electronics database (left) and
**

product-data management system (right).

**built until other information systems have already been in existence for years.
**

In such an experimental setting, full understanding of the requirements for

subsequent information systems can often only be achieved once that information

systems for the current work have been implemented. Nevertheless, since some

information systems are already in need of data integration, one either has to

build a global logical schema today which might become invalid later, leading

to serious maintenance problems of the information infrastructure (that is, the

logical views that map sources), or an approach has to be followed that goes

without such a schema. Since it is impossible to preview all the requirements

of a complex system far into the future, one cannot avoid the need for change

through proper a priori design.

Concept Mismatch. It is clear from the above observations that concept mis-

match between schemata relevant to data integration may occur in the domain

of high energy physics research.

**Example 1.3.1 Assume there are two information systems, the first of which
**

is a database holding data on electronics components13 of an experiment under

construction, with the relational schema

R1 = {pc cpu(P c, Cpu), pc location(P c, LocId), location(LocId, LocName)}

**The database represents information about PCs and their CPUs as well as
**

the location where these parts currently are to be found. Locations have a name

13

To make the example more easily accessible, we speak of personal computers as sole elec-

tronics parts represented. Of course, personal computers are not representative building blocks

of high-energy physics experiments.

22 CHAPTER 1. INTRODUCTION

destination source

schema schema

Parts of

PCs

Project1

PCs of Project1

**Figure 1.5: Concept mismatch between PCs of the electronics database and parts
**

of the product-data management system of “Project1”.

**and an identifier. The second system is a product data management system for
**

a subproject “Project1” with the schema

**R2 = {part of(P art1, P art2), part location(P art, LocId),
**

location(LocId, LocName)}

**(see also Figure 1.4). The second database schema represents an assembly
**

tree of “Project1” by the relation “part of” and again the locations of parts.

Let us now assume that the first information system (the electronics database)

holds data that should be shared with the second. We assume that while the

names of the locations are the same in the second as in the first information

system, the domains of the location ids in the two information systems must be

assumed to be distinct, and cannot be shared.

We thus experience two kinds of complications with this integration problem.

The distinct key domains for locations in the two information systems in fact

entail that a correspondence between (derived) concepts in the two schemata is

to be made that are both to be defined by queries14 . Furthermore, we observe

concept mismatch. The first schema only contains electronics parts but may do

so for other projects besides “Project1” as well while in the second schema only

parts of “Project1” are to be represented, but those parts are not restricted to

electronics parts (Figure 1.5).

As a third complication in this example, we assume some granularity mis-

match. Assume that the second information system is to hold a more detailed

model of “Project1” than the first and shall represent CPUs as parts of main-

boards of PCs and those in turn as parts of PCs, rather than just CPUs as parts

of PCs. Of course, we have no information on mainboards in the electronics

database, but this information could be obtained from another source.

14

Thus, this correspondence could neither be expressed in GAV nor in LAV.

1.4. CONTRIBUTIONS OF THIS THESIS 23

**We could encode this by the following semantic constraint expressing a map-
**

ping between schemata by a containment relationship between two queries:

**{hP c, Cpu, LocNamei |
**

∃Mb, LocId : R2 .part of(Mb, P c) ∧ R2 .part of(Cpu, Mb) ∧

R2 .location(LocId, LocName) ∧

R2 .part location(P c, LocId)} ⊇

{hP c, Cpu, LocNamei |

∃LocId : R1 .pc cpu(P c, Cpu) ∧ R1 .belongs to(P c, “Project1”) ∧

R1 .location(LocId, LocName) ∧ R1 .pc location(P c, LocId)}

**Informally, one may read this constraint as
**

PCs together with their CPUs and locations which are marked as belong-

ing to ‘Project1’ in the first information system should be part of the

answers to queries over parts and their locations in the second informa-

tion system, where CPUs should be known as parts two levels below PCs

in the assembly hierarchy represented by the part of relation.

We do not provide any formal semantics of such constraints for data integra-

tion at this point, but rely on the intuition that such a containment constraint

between two queries expresses the desired inter-schema dependency and allows,

given appropriate reasoning algorithms (if they exist), to perform data integration

in the presence of concept mismatch in a wide sense.

**Large Data Sets. Scientific computing has always been known for manipulating
**

very large amounts of data. Data volumes in information systems related to the

construction of LHC experiments are expected to be in the Terabyte range, and

experimental data collected during the lifetime of LHC will amount to dozens of

Petabytes. For scalability reasons, information integration has to be carried out

on the level of queries (query rewriting) rather than data (data transformation).

**1.4 Contributions of this Thesis
**

This thesis is, to the best of our knowledge, the first to actually address the

problem of data integration with multiple unsophisticated evolving autonomous

integration schemata. Each such schema may consist of both source relations that

hold data and logical relations that do not. Schemata may be designed without

taking other schemata or data integration considerations into account. Each

query over a schema is rewritten into a query exclusively over source relations of

information systems in the environment, using a number of schema mappings.

We propose an approach to data integration (see Figure 1.6) based on model

management and query rewriting with expressive constraints within a federated

architecture. Our flavor of query rewriting is based on constraints with clean,

24 CHAPTER 1. INTRODUCTION

**Repository Information systems
**

Editor

n

tio

sla

n

schema

Tra

data

Schemata Query

Rewriting Mediator

Proxy Proxy Query Facility

relational

schema Phys. Plan Mediator

Generation Proxy Query Facility

**Query Plan Mediator
**

Proxy

Mappings Execution Query Facility

Repository Mediator

Figure 1.6: Architecture of the information infrastructure

**expressive semantics. It allows for mappings between schemata that are general-
**

izations of both the LAV and GAV paradigms.

Regarding query rewriting, we first provide characterizations of two different

semantics for query rewriting with symmetric constraints, a classical logical and

one that is motivated by rewrite systems [DJ90]. The rewrite systems semantics

is based on the intuitions of local-as-view rewriting and generalizes from them.

We formally outline both semantics as well as algorithms for both which, given a

conjunctive query, enumerate the maximally contained rewritings15 . We discuss

various relevant aspects of query rewriting in our context, such as minimality and

nonredundancy of conjunctive queries in the rewritings. Next we compare the

two semantics and argue that the second is more intuitive and may fit better the

expectations of human users of data integration systems than the first. Following

the philosophy of that semantics, rewritings can be computed by making use of

database techniques such as query optimization and ideas from e.g. algorithms

developed for the problem of answering queries using views. We believe that in

a practical information integration context there are certain regularities (such as

sets of predicates – schemata – from which predicates are used together in queries,

while there are few queries that combine predicates from several schemata) that

render this approach more efficient in practice. Surprisingly, however, it can be

shown that the two semantics coincide. We then present a scalable algorithm for

the rewrite systems semantics (based on previous work such as [PL00]), which

we have implemented in a practical system16 , CindRew . We evaluate it experi-

mentally against other algorithms for the same problem. It turns out that our

implementation, which we make available for download, scales to thousands of

15

The notion of maximally contained rewritings is the one that usually best describes the

intuitive idea of “best rewritings possible” in a data integration context.

16

This system can be checked out at http://home.cern.ch/∼chkoch/cindrew/

1.5. RELEVANCE 25

**constraints and realistic applications. We conclude with a discussion of how our
**

query rewriting approach fits into state-of-the-art data integration and model

management systems.

Regarding model management, we present definitions of data models, sche-

mata, mappings, and a set of expressive model management operations for the

management of schemata in a data integration setting. We argue that our ap-

proach can overcome the problems related to “unsophisticated” legacy integration

schemata, and provide a sketch of a methodology for managing evolving map-

pings.

1.5 Relevance

As we discuss a framework for data integration that is based on very weak as-

sumptions, this paper is relevant to a large number of applications in which

other approaches eventually fail. These include networks of autonomous virtual

enterprises having different deployment lifecycles or standards for their informa-

tion systems, the information infrastructure of large international collaborations

(e.g., in science), and large enterprises that face the integration of several exist-

ing heterogeneous data warehouses after mergers or acquisitions or major change

of business model. More generally, our work is applicable in simply any envi-

ronment in which anything less than full commitment exists towards far-ranging

reengineering of information systems to bring all information systems that roam

its environment under a single common enterprise model. Obviously, our work

may also allow federated databases [HM85, SL90] to deal more successfully with

schema evolution.

Let us reconsider the point of design autonomy for schemata of information

systems in the case of companies and e-commerce. For many good reasons, com-

panies nowadays want to have their information systems interoperate; however,

there is no sufficiently strong trend towards agreeing on schemata. While there

is clearly much work done towards standardization, large players in IT have an

incentive to propose competing “standards” and bodies of meta-data. Asking

for common schemata beyond enterprise boundaries today is hardly realistic.

Instead, even the integration of the information systems inside a single large

enterprise is a problem almost too hard to solve17 , and motivates some indepen-

dence of the information infrastructure of horizontal or vertical business units,

again leading to the legacy integration schema problem that we want to address

here. That mentioned, the work in this thesis is highly relevant to business-

to-business e-commerce and the management of the extended supply chain and

17

This of course excludes the issue of data warehouses, which, although they have a global

scope w.r.t. the enterprise, address only a small part of the company data (in terms of schema

complexity, not volume) – such as sales information – that are usually well understood and

where requirements are not expected to change much in the future.

26 CHAPTER 1. INTRODUCTION

virtual enterprises.

Data warehouses that have been the results of large and very expensive de-

sign and reengineering efforts customized to a specific enterprise really are legacy

systems from the day when their design phase ends. Similarly, when companies

merge, the schemata of those data warehouses that the former entities created

are again bound to feature a substantial degree of heterogeneity. This can be ap-

proached in two ways, either by considering these schemata legacy or by creating

a new, truly global information system (almost) from scratch.

1.6 Overview

The remainder of this thesis is structured as follows. In Chapter 2, some pre-

liminary notions from database theory, computability theory, and complexity

theory are presented. Chapter 3 discusses previous work on data integration.

We start with definitions in Section 3.1 and consecutively discuss federated and

multidatabases, data warehousing, mediator systems, information integration in

AI, global-as-view and local-as-view integration (the latter is presented at some

length, since its theory will be highly relevant to our work of Chapter 5), the

description logics-based and model management approaches to data integration,

and finally, in Section 3.9, we discuss the various approaches by maintainabil-

ity and other aspects. In Chapter 4, we present our reference architecture for

data integration and discuss its building blocks, which will be treated in more

detail in consecutive chapters. Chapter 5 presents our approach to query rewrit-

ing with expressive symmetric constraints. Chapter 6 first discusses our flavor

of schemata, mappings and model management operations, and then provides

some thoughts on how to guide the modeling process for mappings such that the

integration infrastructure can be managed as easily as possible. We discuss some

advanced issues of query rewriting, notably extensions of query languages such

as recursion and sources with binding patterns in Chapter 7. We also discuss an-

other application of our work on query rewriting with symmetric constraints, the

maintenance of physical data independence under schema evolution. Chapter 8

concludes with a final discussion of the practical implications of this thesis.

Chapter 2

Preliminaries

**This chapter discusses some preliminaries which mainly stem from database the-
**

ory and which will be needed in later chapters. It is beyond the scope of this

thesis to give a detailed account of computability theory and complexity theory.

We refer to [HU79, Sip97, GJ79, Joh90, Pap94, DEGV] for introductory texts in

these areas. We also assume a basic understanding of databases, schemata, and

query languages, and notably SQL (for an introductory work on this see [Ull88]).

Finally, we presume basic understanding of mathematical logics and automated

theorem proving, including concepts such as resolution and refutation, and no-

tions such as predicates, atoms, terms, Skolem function, Horn clauses, and unit

clauses, which are used in the standard way (see e.g. [RN95, Pap94]).

We define the following access functions for later use: Given a Horn clause c,

Head(c) returns c’s head atom and Body(c) returns the ordered list of its body

atoms. Bodyi (c) returns the i-th body atom. P red(a) returns the predicate name

of atom a, while P reds(Body(c)) returns the predicate names of the atoms in the

body of clause c. V ars(a) returns the set of variables appearing in atom a and

V ar(Body(c)) returns the variables in the body of the clause c.

We will mainly focus on the relational data model and relational queries

[Cod70, Ull88, Ull89, Kan90] under a set-based rather than bag-based seman-

tics (That is, answers to queries are sets, while they are bags in the original

relational model [Cod70] and SQL).

2.1 Query Languages

Let dom be a countably infinite domain of atomic values. A relation schema R

is a relation name together with a sort, which is a tuple 1 of attribute names, and

an arity, i.e.

1

Relation schemata are usually defined as sets of attributes. However, we choose the tuple,

as we will use the unnamed calculus perspective widely throughout this work.

27

28 CHAPTER 2. PRELIMINARIES

sort(R) = hA1 , . . . , An i arity(R) = n

**A (relational) schema R is a set of relation schemata. A relation I is a finite
**

set of tuples, I ⊆ domn . A database instance I is a set of relations.

A relational query Q is a function that maps each instance I over a schema

R and dom to another instance J over a different schema R’.

Relational queries can be seen from at least two perspectives, an algebraic

and a calculus viewpoint. Relational algebra ALG is based on the following basic

algebraic operations (see [Cod70] or [Ull88, AHV95]):

**• Set-based operations (intersection ∩, union ∪, and difference \) over rela-
**

tions of the same sort (that is, arity, as we assume a single domain dom of

atomic values).

**• Tuple-based operations (projection π, which eliminates or renames rows of
**

relations, and selection σ, which filters tuples of a relation according to a

predicate built by conjunction of equality atoms, which are statements of

the form A = B, where A, B are relational attributes).

**• The cartesian product × as a constructive operation that, given two rela-
**

tions R1 and R2 of arities n and m, respectively, produces a new relation

of arity n + m which contains a tuple ht1 , t2 i for each distinct pair of tuples

t1 , t2 with t1 ∈ R1 and t2 ∈ R2 .

**Other operations (e.g., various kinds of joins) can be defined from these.
**

There are various subtleties, such as named and unnamed perspectives of ALG,

for which we refer to [AHV95].

Queries in the first-order relational domain calculus CALC are of the form

{hX̄i | Φ(X̄)}

**where X̄ is a tuple of variables (called “unbound” or “distinguished”) and Φ
**

is a first-order formula (using ∀, ∃, ∧, ∨, and ¬) over relational predicates pi .

An important desirable property of well-behaved database queries is domain

independence. Let the set of all atomic values appearing in a database I be

called the active domain (adom). A CALC query Q over a schema R is domain

independent iff, for any possible database I over R, Qdom (I) = Qadom (I).

**Example 2.1.1 The CALC query {hx, yi | p(x)} is not domain independent, as
**

the variable y is free to bind with any member of the domain. Clearly, such a

query does not satisfy the intuitions of well-behaved database queries.

2.1. QUERY LANGUAGES 29

**Unfortunately, the domain independence property is undecidable for CALC.
**

An alternative purely syntactic property is safety or range restriction. We refer

to [AHV95] for a treatment of safe-range calculus CALCsr , which is necessarily

somewhat lengthy. It can be shown that ALG, the domain independent relational

calculus and CALCsr are all (language) equivalent.

We refer to the class of ∀, ¬-free queries as the positive relational calculus

queries and the queries that only use ∃ and ∧ to build formulae as the conjunctive

queries. By default, conjunctive queries may contain constants but no built-in

arithmetic comparison operators.

Conjunctive queries can be written as function-free Horn clauses, called dat-

alog notation. A conjunctive query {hX̄i | ∃Ȳ : p1 (X̄1 ) ∧ . . . ∧ pn (X̄n )} is written

as a datalog rule

q(X̄) ← p1 (X̄1 ), . . . , pn (X̄n ).

**Furthermore, conjunctive queries have to be safe. Safety in the case of con-
**

junctive queries is quite simple to define. A conjunctive query is safe iff each

variable in the head also appears somewhere in the atoms built from database

predicates in the body, X̄ ⊆ X̄1 ∪ . . . ∪ X̄n . Throughout this thesis, we choose

among the set-theoretic notation for conjunctive queries shown above and the

datalog notation, whichever is most convenient to support the presentation.

Conjunctive queries correspond to select-from-where clauses in SQL where

constraints in the where clause only use equality (=) as comparison operator.

**Example 2.1.2 The subsumed query from Example 1.3.1 (a conjunctive query)
**

can be written as a select-from-where query in SQL

**select pc, cpu, lname
**

from pc cpu, belongs to, loc, pc loc

where pc cpu.pc = belongs to.pc

and pc cpu.pc = pc loc.pc

and pc loc.lid = loc.lid

and belongs to.org entity = “Project1”;

or equivalently

**q(P c, Cpu, LName) ← pc cpu(P c, Cpu), belongs to(P c, “Project1”),
**

loc(LId, LName), pc loc(P c, LId)}

in datalog rule notation or

πPc,Cpu,LName (pc cpu ⊲⊳ σOrg Entity=“P roject1′′ (belongs to) ⊲⊳ pc loc ⊲⊳ loc)

**as an ALG query.
**

30 CHAPTER 2. PRELIMINARIES

**Queries with inequality constraints (i.e., 6=, <, ≤, also called arithmetic com-
**

parison predicates or builtin predicates) are outside of ALG or CALC in principle,

but extensions can be defined without much difficulty2 . A conjunctive query with

inequalities is a clause of the form

q(X̄) ← p1 (X̄1 ), . . . , pn (X̄n ), xi1,1 θ1 xi1,2 , . . . , xim,1 θm xim,2 .

**where the xij,k are variables in X̄1 , . . . , X̄n and θj ∈ {6=, <, ≤}.
**

A datalog program is a set of datalog rules. The dependency graph of a datalog

program P is the directed graph hV, Ei where V is the set of predicate names in

P and E contains an arc from predicate pi to predicate pj iff there is a datalog

rule in P such that pi is its head predicate and pj appears in the body of that

same rule. A datalog program is recursive iff its dependency graph is cyclic.

Positive queries (select-from-where-union queries in SQL) can be written as

nonrecursive datalog programs. Since conjunctive queries are closed under com-

position, all positive queries can also be transformed into equivalent sets of con-

junctive queries (with the head atoms over the same “query” predicate). The size

of these sets can be exponentially larger than the corresponding nonrecursive dat-

alog programs. The process of transforming a nonrecursive datalog program into

a set of conjunctive queries is a form of translating a logical formula in Conjunc-

tive Normal Form (CNF) and is called query unfolding.

Example 2.1.3 The nonrecursive datalog program

q(x, y, z, w) ← a(x, y, z, w).

a(x, y, z, 1) ← b(x, y, z). a(x, y, z, 2) ← b(x, y, z).

b(x, y, 1) ← c(x, y). b(x, y, 2) ← c(x, y).

c(x, 1) ← d(x). c(x, 2) ← d(x).

with 2 ∗ 3 + 1 = 7 rules is equivalent to the following set

q(x, 1, 1, 1) ← d(x). q(x, 1, 1, 2) ← d(x).

q(x, 1, 2, 1) ← d(x). q(x, 1, 2, 2) ← d(x).

q(x, 2, 1, 1) ← d(x). q(x, 2, 1, 2) ← d(x).

q(x, 2, 2, 1) ← d(x). q(x, 2, 2, 2) ← d(x).

of 23 conjunctive queries.

**Relational algebra and calculus are far from representing all computable que-
**

ries over relational databases. For example, not even the transitive closure of

2

There are, however, a few subtle issues such as the question if the domain is totally ordered

with its impact on data independence [CH80, CH82] that are important for the theory of

queries. Since we will only touch queries with inequalities shortly, we leave this aside.

2.2. QUERY CONTAINMENT 31

**binary relations can be expressed using the first-order queries3 . Much has been
**

said on categories and hierarchies of relational query languages, and examples of

languages strictly more expressive than relational algebra and calculus are, for

instance, datalog with negation (under various semantics) or the while queries.

We refer to [CH82, Cha88, Kan90, AHV95] for more on these issues.

Treatments of complexity and expressiveness of relational query languages can

be found in [Var82, CH82, Cha88, AHV95]. We leave these issues to the related

literature and remark only that the positive relational calculus queries are (data)-

complete in PSPACE [Var82]. The decision problem whether an unfolding of a

conjunctive query with a nonrecursive datalog program (with constants) exists

that uses only certain relational predicates – which is related to the approach to

data integration developed later on in this thesis – is equally PSPACE-complete

and thus presumably a computationally hard problem.

2.2 Query Containment

The problem of deciding whether a query Q1 is contained in a query Q2 (denoted

Q1 ⊆ Q2 ) (possibly under a number of constraints describing a schema) is the

one of deciding whether for any possible databases satisfying the constraints, each

tuple in the result of Q1 is contained in the result of Q2 . Two queries are called

equivalent, denoted Q1 ≡ Q2 , iff Q1 ⊆ Q2 and Q1 ⊇ Q2 .

The containment problem quickly becomes undecidable for expressive query

languages. Already for relational algebra and calculus, the problem is undecid-

able [SY80, Kan90]. In fact, the problem is co-r.e. but not recursive (under the

assumption that databases are finite but the domain is not). Checking the con-

tainment of two queries would require a noncontainment check for every finite

database over dom.

For conjunctive queries, the containment problem is decidable and NP-com-

plete [CM77]. Since queries tend to be small, query containment can be prac-

tically used, for instance in query optimization or data integration [CKPS95,

YL87]. It is usually formalized using the notion of containment mappings (ho-

momorphisms) [CM77].

**Definition 2.2.1 Let Q1 and Q2 be two conjunctive queries. A containment
**

mapping θ is a function from the variables and constants of Q1 into the variables

and constants of Q2 that is

• the identity on the constants of Q1

**• Headi (Q2 ) for a variable Headi (Q1 )
**

3

However, transitive closure can of course be expressed in datalog

32 CHAPTER 2. PRELIMINARIES

• and for which for every atom p(x1 , . . . , xn ) ∈ Body(Q1 ),

p(θ(x1 ), . . . , θ(xn )) ∈ Body(Q2)

**It can be shown that for two conjunctive queries Q1 and Q2 , the containment
**

Q1 ⊆ Q2 holds iff there is a containment mapping from Q2 into Q1 [CM77].

Example 2.2.2 [AHV95] The two conjunctive queries

q1 (x, y, z) ← p(x2 , y1 , z), p(x, y1 , z1 ), p(x1 , y, z1),

p(x, y2 , z2 ), p(x2 , y2, z).

and

q2 (x, y, z) ← p(x2 , y1 , z), p(x, y1 , z1 ), p(x1 , y, z1).

are equivalent. For q1 ⊆ q2 , the containment mapping is the identity. Clearly,

since Body(q2) ⊂ Body(q1), and the heads of the two queries match, q1 ⊆ q2 must

hold. For the other direction, we have θ(x) = x, θ(y) = y, θ(z) = z, θ(x1 ) = x1 ,

θ(y1 ) = y1 , θ(z1 ) = z1 , θ(x2 ) = x2 , θ(y2 ) = y1 , and θ(z2 ) = z1 .

**An alternative way [Ull97] of deciding whether a conjunctive query Q1 is
**

contained in a second, Q2 , is to freeze the variables of Q1 into new constants (i.e.,

which do not appear in the two queries) and to evaluate Q2 on the canonical

database created from the frozen body atoms of Q1 . Q1 is then contained in Q2

if and only if the frozen head of Q1 appears in the result of Q2 over the canonical

database.

**Example 2.2.3 Consider again the two queries of Example 2.2.2. The canon-
**

ical database for q2 is I = {p(ax2 , ay1 , az ), p(ax , ay1 , az1 ), p(ax1 , ay , az1 )} where

ax , ay , az , ax1 , ay1 , az1 , ax2 are constants. We have

**q1 (I) = {hax2 , ay1 , az i, hax2 , ay1 , az1 i, hax , ay1 , az i, hax , ay1 , az1 i,
**

hax , ay , az i, hax , ay , az1 i, hax1 , ay1 , az1 i, hax1 , ay , az1 i}

**Since the frozen head of q2 is hax , ay , az i and hax , ay , az i ∈ q1 (I), q2 is contained
**

in q1 .

**The containment of positive queries Q1 , Q2 can be checked by transforming
**

them into sets of conjunctive queries Q′1 , Q′2 . Q′1 is of course contained in Q′2 iff

each member query of Q′1 is individually contained in a member query of Q′2 .

2.3. DEPENDENCIES 33

Bibliographic Notes

The containment problem for conjunctive queries is NP-complete, as mentioned.

The problem can be efficiently solved for two queries if neither query contains

more than two atoms of the same relational predicate [Sar91]. In that case, a

very efficient algorithm exists that runs in time linear in the size of the queries.

Another polynomial-complexity case is encountered when the so-called hypergraph

of the query to be tested for subsumption is acyclic [YO79, FMU82, AHV95]. For

that class of queries, the technique of Example 2.2.3 can be combined with the

polynomial expression complexity of the candidate subsumer query.

If arithmetic comparison predicates4 are permitted in conjunctive queries

[Klu88], the complexity of checking query containment is harder and jumps to the

second level of the polynomial hierarchy [vdM92]. The containment of datalog

queries is undecidable [Shm87]. This remains true even for some very restricted

classes of single-rule programs (sirups) [Kan90]. Containment of a conjunctive

query in a datalog query is EXPTIME-complete – this problem can be solved with

the method of Example 2.2.3, but then consumes the full expression complexity

of datalog [Var82] (i.e., EXPTIME). The opposite direction, i.e. containment of

a datalog program in a conjunctive query, is still decidable but highly intractable

(it is 2-EXPTIME-complete [CV92, CV94, CV97]).

Other interesting recent work has been on the containment of so-called regular

path queries – which have found much research interest in the field of semistruc-

tured databases – under constraints [CDL98a] and on containment of a class of

queries over databases with complex objects [LS97] (see also Section 2.5).

2.3 Dependencies

Dependencies are used in database design to add semantics and integrity con-

straints to a schema, which database instances have to comply to. Two particu-

larly important classes of dependencies are functional dependencies (abbreviated

fd’s) and inclusion dependencies (ind’s).

A functional dependency R : X → Y over a relational predicate R (where X

and Y are sets of attribute names of R5 ) has the following semantics. It enforces

that for each relation instance over R, for each pair t1 , t2 of tuples in the instance,

if for each attribute name in X the values in t1 and t2 are pairwise equal, then

the values for the attributes in Y must be equal as well.

Primary keys are special cases of functional dependencies where X∪Y contains

all attributes of R.

4

Such queries satisfy the real-worl need of asking queries where an attribute is to be, for

instance, of value greater than a certain constant.

5

Under the unnamed perspective sufficient for conjunctive queries in datalog notation, we

will refer to the i-th attribute position in R by $i, instead of an attribute name.

34 CHAPTER 2. PRELIMINARIES

**Example 2.3.1 Let R ⊆ dom3 be a tertiary relation with two functional de-
**

pendencies R : $1 → $2 $3 (i.e., the first attribute is a primary key for R) and

R : $3 → $2. Consider an instance I = {h1, 2, 3i}. The attempt to insert a new

tuple h1, 2, 4i into R would violate the first fd, while the attempt to do the same

for h5, 6, 3i would violate the second.

**Informally, inclusion dependencies are containment relationships between que-
**

ries of the form πγ (R), i.e., attributes of a single relation R may be reordered or

projected out. Foreign key constraints, which require that a foreign key stored in

one tuple must also exist in the key attribute position of some tuple of a usually

different relation, are inclusion dependencies.

Dependencies as database semantics, notably, are valuable in query optimiza-

tion and allow to enforce the integrity of database updates.

**2.4 Global Query Optimization
**

Modern database systems rely on the idea of a separation of physical and logical

schemata in order to simplify their use [TK78, AHV95]. This, together with the

declarative flavor of many query languages, leads to the need to optimize queries

such that they execute quickly.

In the general case of the relational queries (i.e., ALG or the relational calcu-

lus), global optimization is not computable. For conjunctive queries, and on the

logical level, where physical cost-based metrics can be left out of consideration,

though, global optimality (that is, minimality) can be achieved. A conjunctive

query Q is minimal if there is no equivalent conjunctive query Q′ s.t. Q′ has fewer

atoms (subgoals) in its body than Q.

This notion of optimality is justified because joins of relations are usually

among the most expensive relational (algebra) operations carried out by a rela-

tional database system during query execution. Minimality is of interest in data

integration as well.

Computing a minimal equivalent conjunctive query is strongly related to the

query containment problem (see Section 2.2). The associated decision problem

is again NP-complete. Minimal queries can be computed using the following fact

[CM77]. Given a conjunctive query Q, there is a minimal query Q′ (with Q ≡ Q′ )

s.t. Head(Q) = Head(Q′ ) and Body(Q′ ) ⊆ Body(Q), i.e. the heads are equal

and the body of Q′ contains a subset of the subgoals of Q, without any changes

to variables or constants. Conjunctive queries can thus be optimized by checking

all queries created by dropping body atoms from Q while preserving equivalence

and searching for the smallest such query.

**Example 2.4.1 Take the queries q1 and q2 from Example 2.2.2. By checking all
**

subsets of Body(q2), it can be seen that q2 is already minimal. In fact, q2 is also

2.5. COMPLEX VALUES AND OBJECT IDENTITIES 35

**a minimal query for q1 , as Body(q2 ) is the smallest subset of Body(q1) such that
**

q2 and q1 remain equivalent.

**Global optimization of conjunctive queries under a number of dependencies
**

(e.g., fd’s) can be carried out using a folklore technique called the chase [ABU79,

MMS79], for which we refer to the literature (see also [AHV95]).

**2.5 Complex Values and Object Identities
**

Among the principal additional features of the object-oriented data model [BM93,

Kim95, CBB+ 97], compared to the relational model, we have object identifiers,

objects that have complex (“nested”) values, IS-A hierarchies, and behavior at-

tributed to classes of objects, usually via (mostly) imperative methods. For the

purpose of querying and data integration under the object-oriented data model,

the notions of object identifiers and complex objects deserve some consideration.

Research on complex values in database theory has started by giving up the

requirement that values in relations may only contain atomic values of the domain

(non-first normal form databases). The complex value model, theoretically very

elegant, is strictly a generalization of the relational data model. Values are created

inductively from set and tuple constructors. The relational data model is thus

the special case of the complex value model where each relation is a set of tuples

over the domain. For instance,

{hA : dom, B : dom, C : {hA : dom, B : {dom}i}i}

is a valid sort in the complex value model and

{ha, b, {hc, {}i, hd, {e, g}i}i, he, f, {}i}

**is a value of this sort, where a, b, c, d, e, f , g are constants of dom. As
**

for the relational data model, algebra and calculus-based query languages can

be specified, and equivalences be established. Informally, in the algebraic per-

spective, set-based operations (union, intersection and difference), which are re-

quired to operate over sets of the same sorts, and simple tuple-based operations

(such as projection) known from the relational model are extended by a more ex-

pressive selection operation, which may have conditions such as set membership

and equality of complex values, and the powerset operation, furthermore tuple-

and set-creation and destruction operations (see [AHV95]). Other operations

such as renaming, join, and nesting and unnesting can be defined from these.

The complex-value algebra (ALGcv ) has hyperexponential complexity. When the

powerset operation is replaced by nesting and unnesting operations, we arrive

at the so-called nested relation algebra ALGcv− . All queries in ALGcv− can be

36 CHAPTER 2. PRELIMINARIES

**executed efficiently (relative to the size of the data), which has motivated com-
**

mercial object-oriented database systems such as O2 [LRV88] and standards such

as ODMG’s OQL [CBB+ 97] to closely adopt it.

Interestingly, it can be shown that all ALGcv− queries over relational databases

have equivalent relational queries [AB88, AHV95]. This is due to the fact that

unnested values in a tuple always represent keys for the nested tuples; nestings

are thus purely cosmetic.

Furthermore, every complex value database can be transformed (in polyno-

mial time relative to the size of the complex value database) into a relational

one [AHV95] (This, however, requires keys that identify nested tuples as objects,

i.e., object identifiers). The nested relation model - and with it a large class

of object-oriented queries - is thus just “syntactic sugaring” over the relational

data model with keys as supplements for object identifiers. From the query-only

standpoint of data integration, where structural integration can take care of in-

venting object identifiers in the canonical transformation between data models,

we can thus develop techniques in terms of relational queries, which can then be

straightforwardly applied to object-oriented databases as well6 .

We also make a comment on the calculus perspective. Differently from the

relational model, in the complex value calculus CALCcv variables may represent

and be quantified over complex values. We are thus operating in a high-order

predicate calculus with a finite model semantics. The generalization of range

restriction (called safe-range calculus) for the relational calculus to the complex

value calculus is straightforward but verbose (see [AHV95]). It can be shown

that ALGcv and the safe-range calculus CALCcv (which represents exactly the

domain independent complex value calculus queries) are equivalent. Furthermore,

if set inclusion is disallowed but set membership as the analog of nesting remains

permitted, the so-called strongly safe-range calculus CALCcv− is attained, which

is equivalent to ALGcv− .

Conjunctive nested relation algebra – in which set union and difference have

been removed from ALGcv− – is thus equivalent to the conjunctive relational

queries.

**Example 2.5.1 Consider an instance Parts, which is a set of complex values of
**

the following sort. A part (in a product-data management system) is a tuple of a

barcode B, a name N, and a set of characteristics C. A characteristic is a tuple

of a name N and a set of data elements D. A data element is a tuple of a name

N, a unit of measurement U, and a value V 7 . The sort can be thus written as

**hB : dom, N : dom, C : {hN : dom, D : {hN : dom, U : dom, V : domi}i}i
**

6

Some support for object-oriented databases is a requirement in the use case of Section 1.3.

7

For simplicity, we assume that all atomic values are of the same domain dom. This is not

an actual restriction unless arithmetic comparison operators (<, ≤) are allowed in the query

language.

2.5. COMPLEX VALUES AND OBJECT IDENTITIES 37

Suppose now that we ask the following query in nested relation algebra ALGcv− :

πN,B,D (unnestC (πB,C (Parts)))

which asks for transformed complex values of sort

hN : dom, B : dom, D : {hN : dom, U : dom, V : domi}i

and can be formulated in strongly safe-range calculus CALCcv− as

**{x : hN, B, D : {hN, U, V i}i | ∃y, z, z ′ , w, w ′, u, u′ :
**

y : hB, N, C : {hN, D : {hN, U, V i}i}i ∧

z : {hN, D, {hN, U, V i}i} ∧

z ′ : hN, D, {hN, U, V i}i ∧

w : {hN, U, V i} ∧ w ′ : hN, U, V i ∧

u : {hN, U, V i} ∧ u′ : hN, U, V i ∧

x.B = y.B ∧ y.C = z ∧ z ′ ∈ z ∧

z ′ .N = x.N ∧ z ′ .D = w ∧ w ′ ∈ w ∧

x.D = u ∧ u′ ∈ u ∧ u′ = w ′ }

Let us map the collection Parts to a flat relational database with schema

R = {Part(P oid, B, N), Char(Coid, N, P oid), DataElement(N, U, V, Coid)}

**where the attributes P oid and Coid stand for object identifiers which must be
**

invented when flattening the data. The above query can now be equivalently

asked in relational algebra as

πN,B,Dn,U,V ((πP oid,B (Part) ⊲⊳ Char) ⊲⊳ πN →Dn,U,V,Coid(DataElement))

**The greatest challenge here is the elimination or renaming of the three name
**

attributes N. The same query has the following equivalent in the (conjunctive)

relational calculus

{hx, y, z, u, vi | ∃i1 , i2 , d : Part(i1 , x, d) ∧ Char(i2 , y, i1) ∧ DataElement(z, u, v, i2 )}

**After executing the query, the results can be nested to get the correct result
**

for the nested relational algebra or calculus query.

38 CHAPTER 2. PRELIMINARIES

Chapter 3

Data Integration

**This chapter briefly surveys several research areas related to data integration.
**

We proceed by first presenting two established architectures, federated and mul-

tidatabases in Section 3.2 and data warehouses in Section 3.3. Next, in Sec-

tion 3.4, we discuss information integration in AI. Several research areas of AI

that are relevant to this thesis are surveyed, including ontology-based global in-

formation systems, capability description and planning, and multi-agent systems

as a further integration architecture. Then we discuss global-as-view integra-

tion (together with an integration architecture, mediator systems) in Section 3.5

and local-as-view integration in Section 3.6. In Sections 3.7 and 3.8 we arrive

at recent data integration approaches. Section 3.9 discusses management and

maintainability issues in large and evolving data integration systems and com-

pares the different approaches presented according to various qualitative aspects.

First, however, we start with some definitions.

**3.1 Definitions and Overview
**

Source integration [JLVV00] refers to the process of integrating a number of

sources (e.g. databases) into one greater common entity. The term is usually

used as part of a greater, more encompassing process, as perceived in the data

warehousing setting, where source integration is usually followed by aggregation

and online analytical processing (OLAP). There are two forms of source inte-

gration, schema integration and data integration. Schema integration [BLN86]

refers to a software engineering or knowledge engineering approach, the process

of reverse-engineering information systems and reengineering schemata in order

to obtain a single common “integrated” schema – which we will not address in

more detail in this thesis. While the terms data and information are of course

not to be confused, data integration and information integration are normally

used synonymously (e.g., [Wie96, Wie92]).

Data integration is the area of research that addresses problems related to

39

40 CHAPTER 3. DATA INTEGRATION

schema source

integration integration

data

reconciliation

data

integration

structural

semantic integration

integration

Figure 3.1: Artist’s impression of source integration.

**the provision of interoperability to information systems by the resolution of het-
**

erogeneity between systems on the level of data. This distinguishes the problem

from the wider aim of cooperative information systems [Coo], where also more

advanced concepts such as workflows, business processes, and supply chains come

into play, and where problems related to coordination and collaboration of sub-

systems are studied which go beyond the techniques required and justified for the

integration of data alone.

The data integration problem can be decomposed into several subproblems.

Structural integration (e.g., wrapping [GK94, RS97]) is concerned with the res-

olution of structural heterogeneity, i.e. the heterogeneity of data models, query

and data access languages, and protocols1 . This problem is particularly inter-

esting when it comes to legacy systems, which are systems that in general have

some aspect that would be changed in an ideal world but in practice cannot be

[AS99]. In practice, this often refers to out-of-date systems in which parts of the

code base or subsystems cannot be adapted to new requirements and technologies

because they are no longer understood by the current maintainers or because the

source code has been lost.

Semantic integration refers to the resolution of semantic mismatch between

schemata. Mismatch of concepts appearing in such schemata may be due to a

number of reasons (see e.g. [GMPQ+ 97]), and may be a consequence of differ-

ences in conceptualizations in the minds of different knowledge engineers. Mis-

1

We experience structural heterogeneity if we need to make a number of databases interop-

erable of which, for example, some are relational and others object-oriented, or if among the

relational databases some are only queryable using SQL while others are only queryable using

QUEL [SHWK76]. Other kinds of structural heterogeneity are encountered when two database

systems use different models for managing transactions or lack middleware compatible with

both which allows to communicate queries and results.

3.2. FEDERATED AND MULTIDATABASES 41

**match may not only occur on the level of schema entities (relations in a relational
**

database or classes in an object-oriented system), but also on the level of data.

The associated problem, called data reconciliation [JLVV00], includes object iden-

tification (i.e., the problem of determining correspondences of objects represented

by different heterogeneous data sources) and the handling of mistakes that hap-

pened during the acquisition of data (e.g. typos), which is usually referred to as

data cleaning. An overview of this classification of source integration is given in

Figure 3.1.

Since for this thesis, the main problem among those discussed in this section

is the resolution of semantic mismatch, we will also put an emphasis on this

problem in the following discussion and comparison of research related to data

integration.

**3.2 Federated and Multidatabases
**

The data integration problem has been addressed early on by work on multi-

database systems. Multidatabase systems are collections of several (distributed)

databases that may be heterogeneous and need to share and exchange data. Ac-

cording to the classification2 of [SL90], federated database systems [HM85] are a

subclass of multidatabase systems. Federated databases are collections of col-

laborating but autonomous component database systems. Nonfederated multi-

database systems, on the other hand, may have several heterogeneous schemata

but lack any other kind of autonomy. Nonfederated multidatabase systems have

one level of management only and all data management operations are performed

uniformly for all component databases. Federated database systems can be cat-

egorized as loosely or tightly coupled systems. Tightly coupled systems are ad-

ministrated as one common entity, while in loosely coupled systems, this is not

the case and component databases are administered independently [SL90].

Component databases of a federated system may be autonomous in several

senses. Design autonomy permits the creators of component databases to make

their own design choices with respect to representation, i.e. data models and

query languages, data managed and schemata used for managing them, and the

conceptualizations and semantic interpretations of the data applied. Other kinds

of component autonomy that are of less interest to this thesis but still deserve to

be mentioned are communication autonomy, execution autonomy and association

autonomy [SL90, HM85]. Autonomy is often in conflict with the need for sharing

data within a federated database system. Thus, one or several kinds of autonomy

may have to be relaxed in practice to be able to provide interoperability.

2

There is some heterogeneity in the nomenclature of this area. A cautionary note is due at

this point: Many of the terms in this chapter have been used heterogeneously by the research

community. Certain choices had to be made in this thesis to allow a uniform presentation,

which are hopefully well documented.

42 CHAPTER 3. DATA INTEGRATION

**External External ... External
**

Schema Schema Schema

Federated Federated

Schema

... Schema

**Export Export ... Export
**

Schema Schema Schema

Component Component

Schema ... Schema

Local ... Local

Schema Schema

Figure 3.2: Federated 5-layer schema architecture

**Modern database systems successfully use a three-tier architecture [TK78]
**

which separates physical (also called internal) from logical representation and

the logical schema in turn from possibly multiple user or application perspectives

(provided by views). In federated database systems, these three layers are con-

sidered insufficient, and a five-layer schema architecture has been proposed (e.g.

[SL90] and Figure 3.2). Under this architecture, there are five types of schemata

between which queries are translated. These five types of schemata are

**• Local schemata. The local schema of a component database corresponds to
**

the logical schema in the classical three-layered architecture of centralized

database systems.

**• Component schemata. The component schema of a database is a version of
**

its local schema translated into the data model and representation formal-

ism shared across the federated database system.

**• Export schemata. An export schema contains only the part of the schema
**

relevant to one integrated federated schema.

**• Federated schemata 3 . This schema is an integrated homogeneous view of
**

the federation, against which a number of export schemata are mapped

(using data integration technology). There may be several such federated

schemata inside a federation, providing different integrated views of the

available data.

3

These are also known as import schemata or global schemata [SL90].

3.3. DATA WAREHOUSING 43

"Data Cube"

Data Marts Data

(MDDBS)

Analysis

Extraction &

Aggregation

Data

Warehouse

Data

Mediator Reconciliation &

Integration

Wrapper Wrapper Wrapper Wrapper Wrapper

Figure 3.3: Data warehousing architecture and process.

**• External Schemata provide application or user-specific views of the feder-
**

ated schemata, as in the classical three-layer architecture.

**This five-layer architecture is believed to provide better support for the inte-
**

gration and management of heterogeneous autonomous databases than the clas-

sical three-layer architecture [HM85, SL90].

3.3 Data Warehousing

Data Warehousing (Figure 3.3) is a somewhat interdisciplinary area of research

whose scope goes beyond pure data integration. The goal is usually, in an en-

terprise environment, to collect data from a number of distributed sites4 (e.g.,

grocery stores), clean and integrate them, and put them into one large central

store, the corporate data warehouse. Data warehousing is also about performing

aggregation of relevant data (e.g. sales data). Data may then be extracted and

transformed according to schemata customized for particular users or analysis

tools (Online Analytical Processing, OLAP) [JLVV00].

Since the data manipulated are in practice often highly mission-critical to

enterprises and may be very large, special technologies have been developed for

4

The point of this is not just the resolution of heterogeneity but also to to have distinct sys-

tems for Online Transaction Processing (OLTP) and data analysis for decision support, which

usually access data in very different ways and also need differently optimized schemata. (In

OLTP, transactions are usually short and occur at a high density, while in OLAP, transactions

are few but long and put emphasis on querying.)

44 CHAPTER 3. DATA INTEGRATION

**dealing with aggregation of data (e.g. the summarization of sales data according
**

to criteria such as categories of products sold, regions, and time spans), such as

multidimensional databases (MDDBMS) or data cubes.

As data integrated against a warehouse are usually materialized there, the

data warehousing literature often makes a distinction between mediation, which is

confined to data integration on demand, i.e. when a query against the warehouse

occurs (also called “virtual” integration or the lazy approach [Wid96] by data

warehouse researchers), and materialized data integration (the eager approach

[Wid96]). The materialized approach to data integration in fact adds problems

related to dynamic aspects (e.g., the view update and view maintenance prob-

lems). These problems are not yet well understood, and known theoretical results

are often quite negative [AHV95].

Data Warehousing has received considerable interest in industry, and there

are several commercial implementations, such as those by Informix and MicroS-

trategy [JLVV00]. Two well-known research systems are WHIPS [GMLY98] and

SQUIRREL [ZHKF95b, ZHKF95a].

3.4 Information Integration in AI

There has traditionally been much cross-fertilization between the artificial intel-

ligence and information systems areas, and the intelligent integration of infor-

mation [Wie96] is not an exception. It is particularly worthwhile to take note of

research on ontologies, capability description, planning, knowledge-based systems,

and multi-agent systems. Another important area are description logics, which

we leave to their own section (Section 3.7). Work in these areas has – sometimes

indirectly – had much influence on data integration.

**3.4.1 Integration against Ontologies
**

There is an ongoing discussion among Formal Ontologists and AI researchers

on how to define ontologies [GN87, Gru, GG95, Gua94, HS97]. One definition

that has been particularly well argued for refers to ontologies as partial accounts

of specifications of conceptualizations [GG95]. Ontologies are logical theories of

parts of conceptualizations (to be found in the mind of some knowledge engineer)

of a problem domain. As such, ontologies may consist of more than taxonomi-

cal knowledge but include virtually any kind of knowledge. In practice we are

interested in work on ontologies in the context of information integration as in-

formation models of AI information systems, powerful forms of schemata.

Ontological engineering [Gua97, Gru92, Gru93a, Gru93b, CTP00] concerns it-

self with the design and maintenance of large ontologies. Several research projects

on tools [DSW+ 99] for ontological engineering, such as the Ontolingua server

[FFR96], have been carried out. One problem also addressed is the one of reengi-

3.4. INFORMATION INTEGRATION IN AI 45

**neering and merging existing ontologies, which is in many ways similar to schema
**

integration [BLN86]. Experiences show much similarity with developments in

object-oriented software engineering and information systems research. Design-

ing and maintaining large ontologies has been found to lead to problems (see the

Cyc experience [LGP+ 90]), and research has followed approaches such as apply-

ing the idea of design patterns [GHJV94] to ontological engineering [CTP00], or

the use of libraries of micro-ontologies, which are small building blocks that can

be composed to create domain ontologies on demand.

AI data integration systems are usually based on an architecture in which

there is one well-designed “global” domain ontology (as a theory of the world

represented) against which a number of wrapped data sources are integrated.

Such systems fall into the category of global information systems For instance, the

influential Carnot system [SCH+ 97] of MCC mapped databases against the large

and well-known Cyc ontology [LGP+ 90, Cyc] using a deductive database language

called LDL [Zan96]. For other similar interesting work see e.g. the OBSERVER

project [MKSI96, MIKS00], SIMS [AK92, HK93, AAA+ 97] and InfoSleuth [NU97,

NBN99, BBB+ 97, FNPB99, NPU98].

It has been claimed (e.g. in [MIKS00]) that global information systems based

on ontologies are a substantial step forward compared to systems that integrate

against database schemata because ontologies allow to describe information con-

tent in data repositories independently of the underlying syntactic representation

of the data. The rationale behind this is that ontologies are defined as artifacts

on the knowledge level [New82, New93] rather than the symbol level and should

be independent of syntactic considerations. The above claim of a practical advan-

tage can be comfortably challenged, however. Apart from the necessary choice

of some vocabulary for naming the concepts, ontological commitments have to

be made on how to interrelate concepts (onto)logically (e.g. by part-of, is-a, and

instance-of relationships) as much as they are needed in database schema de-

sign. Research in Formal Ontology such as [Bra83, GW00b, GW00a] aims at

determining guidelines for ontological commitments. It is highly questionable if

such work could ever keep humans from intuitively disagreeing on such issues.

However, until such consensus is reached, it would be misleading to make the

above claim in the pragmatic context of data integration. Note also that the

OBSERVER system of [MIKS00] uses the CLASSIC description logics system

for representing ontologies, a system that is even considered by its designers to

provide a symbol-level data model [BBMR89a, Bor95] (see also Section 3.7).

**3.4.2 Capability Descriptions and Planning
**

Planning as a particularly important application of problem solving has been

among the core topics of interest in Artificial Intelligence ever since the influential

STRIPS planning system [FN71] established it as a research area in its own right,

46 CHAPTER 3. DATA INTEGRATION

**with its own special theoretical results and algorithms5 [Wel99].
**

Planning problems in STRIPS-like planners are described by an initial state of

the world, a goal state, and a number of planning operators (“actions”), described

by pre- and postconditions and invariants6 . A solution to a planning problem is

then a (possibly only partially ordered) sequence of operator applications that

transforms the world from the given initial state to the desired goal state.

The need for capability description, which is strongly related to such operator

descriptions, in systems that use planning has resulted in a number of interesting

capability description languages [WT98], e.g. LARKS, the capability description

formalism of Retsina [SLK98], description-logics based formalisms [BD99], and

capability description languages for problem-solving methods in knowledge-based

systems (e.g. EXPECT [SGV99]).

Planning for information gathering has received much recent interest because

of its role in intelligent information systems for dealing with the information

overload of the World Wide Web [Etz96, Mae94]. Since planning for information

gathering is a quite special case of planning in general (for instance, information

gathering operations do not change the world in the sense actions in a physical

world do), special techniques have been developed for this problem [KW96, AK92,

GEW96].

The data integration problem can be formulated as a planning problem as well,

with reasoning being done based on the capability descriptions of data sources.

Interestingly, this leads to mappings between data sources and global ontologies

that is the inverse of the classical method of, for instance, Carnot, or that con-

ventionally used in federated and multidatabase systems, data warehouses, and

mediator systems. In the classical method, “destination” concepts that are part

of the “global” integration schemata are described as views over the data sources

(conceptually speaking; in practice, these mappings are often encoded as some

procedural transformation code that does the job). This conventional method of

data integration is thus termed global-as-view (GAV) integration.

Data integration by planning on the other hand proceeds by having contents of

data sources described as capabilities in terms of the global world model. Queries

are answered by building a plan that uses the given data sources as described

in the capability descriptions to extract and combine their data, and executing

it. This kind of data integration, where mappings are expressed as descriptions

of “local” sources in terms of the global ontology, is thus called local-as-view

(LAV) integration7 . Notable AI research that follows this route includes the

OCCAM planning algorithm [KW96] and the SIMS system [AK92, HK93] for

dealing with heterogeneous information sources, which is based on the LOOM

5

Consider, for instance, partial-order planners [PW92, RN95] and more recently SAT plan-

ning [KS92] and Graphplan [BF97].

6

In STRIPS, operators were described by preconditions and so-called add- and delete-lists

for logical statements about the world that are changed by executing an action

7

We will address the GAV and LAV issue in more depth in dedicated sections, 3.6 and 3.5.

3.4. INFORMATION INTEGRATION IN AI 47

**knowledge representation and reasoning system [MB87] (using its description
**

logic for expressing contents of data sources) and the Prodigy planner [CKM91].

**3.4.3 Multi-agent Systems
**

Multi-agent systems (MAS) are, by their very conceptualization, cooperative in-

formation systems par excellence. We avoid touching the unsettled issue of trying

to define software agents here and refer to [Nwa96, WJ95, Wei99] or the exten-

sive community discussion of that issue in the UMBC agents mailing list archives

[Age]. MAS for information integration follow a heavy agent metaphor, in which

agents

• have an explicit logical model of their environment and other agents.

• need to reason over their knowledge and over the states of other agents.

**• need to plan, both for information gathering (i.e., as a part of the data
**

integration problem analogous to query rewriting) and possibly for multi-

agent coordination (e.g. Partial Global Planning [DL91], GPGP [DL92,

DL95]).

**• communicate in expressive agent communication languages. These usually
**

provide elementary building blocks8 for protocols and knowledge exchange

formats (e.g. KIF [GF92]).

**Furthermore, agents in the information integration setting are usually de-
**

signed to be cooperative rather than self-interested [SL95].

Apart from being a welcome testbed and melting pot for various areas of AI

research, the field has its own interesting and still largely unresolved challenges.

The coordination problem in MAS revolves around much more than just provid-

ing languages for communication and knowledge exchange. The collaboration of

agents requires coordination whose provision is not yet sufficiently understood.

Much research has centered around providing coordination algorithms and pro-

tocols (e.g. the Contract Net Protocol [Smi80] and (Generalized) Partial Global

Planning [DL91, DL92, DL95]), research frameworks (e.g. TÆMS [Dec95]), ab-

stractions of protocols (e.g. conversation policies [SCB+ 98, GHB99]), social rules

[COZ00], pragmatics [HGB99], and game-theoretic considerations [PWC95]. For

further interesting work on coordination see [WBLX00, COZ00, Cro94, KJ99].

Another important problem is to establish multi-agent systems as a soft-

ware engineering paradigm – agent-oriented software engineering [Sho93, Jen99,

JW00].

8

These building blocks are sometimes called performatives and at times have been motivated

by speech act theory [Sea69], as in the case of KQML [FFMM94, FL97].

48 CHAPTER 3. DATA INTEGRATION

User

Agent

9 1

10 8

2 Match User

3 maker Agent

13 14

Wrapper 11 1

6

Agent 12 7

8

Data Wrapper

4

Analysis Agent

5 Mediator 4

Agent

6 5

Data 7 3

Analysis 2

Agent

Wrapper

Wrapper Agent

Agent

**Figure 3.4: MAS architectures for the intelligent integration of information. Ar-
**

rows between agents depict exemplary communication flows. Numbers denote

logical time stamps of communication flows.

**Intelligent Information Integration has been a popular application of MAS.
**

Due to their approach of seeking interoperability of several highly autonomous

units (the agents), multi-agent systems are almost by definition performing an

integration task. Several systems thus have addressed information integration,

e.g. Retsina [SLK98], InfoSleuth [NU97, NBN99, BBB+ 97, FNPB99, NPU98],

KRAFT [PHG+ 99] and BOND [TBM99]. Such systems are particularly interest-

ing for their contributions to structural integration9 and have been less ground-

breaking with respect to semantic integration where usually techniques in the

tradition of those discussed elsewhere in this chapter are used10 .

A generic MAS architecture for information integration is depicted in Fig-

ure 3.4. Such cooperative multi-agent systems are networks of collaborating

agents of a number of categories, some of which we list next.

**• Wrapper agents connect data sources (possibly legacy systems) to the sys-
**

9

In principle they constitute the promise of the most open, hot-pluggable middleware in-

frastructure possible.

10

Surprisingly, systems such as InfoSleuth and KRAFT follow the global-as-view paradigm

for integration as planning is not employed on the level of data integration as is the case in

SIMS and OCCAM.

3.4. INFORMATION INTEGRATION IN AI 49

**tem by advertising contents to other agents and listening for and answering
**

data gathering requests of other agents on behalf of the wrapped sources.

• Middle agents [DSW97, GK94] or facilitators aim at solving the connection

problem [DS83], i.e., the problem of enabling providers and requesters in a

multi-agent system to initially meet. Middle agents support interoperability

and cooperation by matching agents with others that may be helpful in

solving their integration problems. Such agents may have varying degrees

of “intelligence” and proactivity, and one notably distinguishes between

matchmakers, brokers and mediators.

Matchmakers are advanced yellow pages services with varying degrees of

sophistication that allow agents to advertise their services as well as to

inquire for services of other agents (e.g. [SLK98]). Broker agents [RZA95]

can be explained as analogous to real-life stock market or real estate brokers.

Brokers solve the connection problem by matching agents, but may (and

usually do) also act as intermediaries in the subsequent problem solving

process. This may, for instance, allow agents communicating via a broker

to remain anonymous. Mediators (e.g. [ABD+ 96]) add additional value

by acting as intermediaries between agents collaborating to achieve some

common goal and employing their own capabilities to support the problem

solving process. More precisely, mediators do not only attack the connection

problem on the level of finding matches, but often also resolve semantic

heterogeneity between agents in a heterogeneous system.

Note, however, that there is substantial terminological heterogeneity re-

garding this issue. Particularly facilitators called brokers have had differ-

ent roles from the one described above in some systems for information

integration [NBN99, PHG+ 99].

• Data analysis and processing agents provide some value-adding reasoning

functionality to the other agents in the system.

• User agents represent the interests of users and gather information from the

system on their behalf.

In Figure 3.4, arrows between agents depict two exemplary communication

flows, one involving a matchmaker and one involving a mediator agent. The

arrows describe the directions of messages sent, and are attributed with logical

time stamps. The main difference between the two types of middle agents that

this figure is meant to clarify is that matchmakers may be consulted for services

but requester agents are then left to themselves for the problem solving task, while

mediators are usually highly involved throughout this process. The matchmaker

of Figure 3.4 checks back whether the agents that it plans to propose to the

requester are able to provide the requested service. This goes beyond a simple

yellow pages service.

50 CHAPTER 3. DATA INTEGRATION

**For influential work on multi-agent systems architectures following the heavy
**

agent metaphor in general, we refer to ARCHON [CJ96, JCL+ 96], the Retsina

infrastructure [SPVG01], and KAoS [BDBW97]. In conclusion, it is necessary

to remark that MAS as cooperative information systems have gone much further

than just to information integration, for instance to managing and integrating

the business processes and supply chains of enterprises [PGR98, JNF98, JFJ+ 96,

JFN+ 00].

**3.5 Global-as-view Integration
**

The global-as-view way by which mappings between schemata are defined – by

describing “global” integrated schemata in terms of the sources11 – has been

used in most of the architectures discussed so far. This includes multidatabase

systems, the data warehouse architecture where we had one component called

the “mediator” which performed data integration, and various AI approaches as

discussed in Section 3.4.

In this section, we will first discuss the mediator architecture of [Wie92], which

has been seminal to information systems research12 . Then, we approach global-

as-view integration in a simplistic way, through classical database views (one

may expect, however, that this is the approach taken most often in industrial

practice). Finally, we briefly discuss some research systems related to this area.

3.5.1 Mediation

Mediators are components of an information system that address a particular

heterogeneity problem in the system and provide a pragmatic “solution” to it. A

mediator is a “black box” that assumes a number of sources with some exported

schema each (these can be, for instance, wrapped databases or other mediators).

Mediators export some interface (some schema) against which data are integrated.

The integration problem is then left to a domain expert to address a certain

aspect of heterogeneity and to implement the mediator. Each mediator thus

encapsulates a particular integration problem. An overview of types of integration

problems (“mediation functions”) is given in [Wie92]. Such mediation functions

include the transformation and subsetting of databases, the merging of multiple

11

This is also the method in which any procedural code that transforms data adhering to one

schema to another one.

12

Note that the term mediation has experienced substantial overload, and we have used it so

far in three different contexts and with four slightly different meanings. Differently from the

mediator concept in our data warehouse architecture [JLVV00], this fourth mediator concept

has a smaller granularity. Differently from mediator agents in AI, Wiederhold’s mediators

are far remote from aspects of multi-agent cooperation and are not meant to be “intelligent”

[Wie92]. Mediation as the “lazy” approach to data integration mentioned earlier in the context

of data warehousing closely coincides with this fourth concept.

3.5. GLOBAL-AS-VIEW INTEGRATION 51

Mediator Mediator

Mediator Mediator

Mediator Mediator Mediator

Wrapper Wrapper Wrapper Wrapper Wrapper

Figure 3.5: A mediator architecture

**heterogeneous databases, the abstraction and generalization of data, and methods
**

for dealing with uncertain data as well as incomplete or mismatched sources.

A typical architecture of a mediator system is shown in Figure 3.5. For struc-

tural integration, data sources are usually wrapped to permit a single way of

accessing sources in terms of data models and query languages. Mediators are

pieces of code encapsulating some operational knowledge of a domain expert,

implementing mediation functions that add value to and remove heterogeneity

from the data provided by the sources.

**3.5.2 Integration by Database Views
**

Let us assume a relational database context. In the global-as-view approach,

global relations are expressed as views (e.g. SQL views) in terms of source rela-

tions. Given a global relation p(X̄) and sources p1 , . . . , pn , p might be expressed

as a (finite) set of conjunctive views13 p(X̄) ← p1 (X̄1 ), . . . , pn (X̄n ).

Given a query posed in terms of view predicates, the query answering process

is simple, as it reduces to simple conjunctive query unfolding (see Section 2.1).

Example 3.5.1 Suppose we have four sources of information about books.

**acm proceedings(T itle, ISBN)
**

13

For the record, such a view is logically equivalent to a declarative constraint of the form

{hX̄i | p(X̄)} ⊇ {hX̄i | ∃Ȳ : p1 (X̄1 ) ∧ . . . ∧ pn (X̄n )}

**in set-theoretic notation where X̄, Ȳ are tuples of variables, X̄1 , . . . , X̄n are tuples of variables
**

and constants, and Ȳ = (X̄1 ∪ . . . ∪ X̄n ) − X̄.

52 CHAPTER 3. DATA INTEGRATION

**book1(ISBN, T itle, Author, P ublisher)
**

product(Name, Category, P roducer, P rice)

book price(ISBN, P rice)

**We can create a positive database view providing an integrated interface to book
**

information as follows (Let us assume we are only interested in titles, publishers

and prices of books).

**book(T itle, P ublisher, P rice) ←
**

book1(ISBN, T itle, Author, P ublisher),

book price(ISBN, P rice).

book(T itle, P ublisher, P rice) ←

product(T itle, “Book”, P ublisher, P rice).

book(T itle, “ACM Press”, P rice) ←

acm proceedings(T itle, ISBN),

book price(ISBN, P rice).

**Queries asked over the relation “book” can be answered by unfolding them
**

with the views.

3.5.3 Systems

Research systems in this area have usually aimed at providing toolkits and de-

scription languages for automating the generation of mediators as far as possible.

Three notable research systems in this area, TSIMMIS [GMPQ+ 97], HERMES

[ACPS96], and Garlic [CHS+ 95], have been no exception. Since global-as-view

integration in its simplest (and relational) form is quite straightforward, research

systems also have put emphasis on advanced aspects such as multimedia data in-

tegration. In the following, we will have a somewhat closer look at the approach

taken in TSIMMIS.

TSIMMIS

TSIMMIS [GMPQ+ 97] (“The Stanford-IBM Manager of Multiple Information

Sources”) is a well-known research prototype that provides generators for me-

diators and wrappers. The generation of mediators and wrappers is a widely

proposed technique for leveraging the practical usefulness of the mediator ap-

proach. In this system, integration is based on the Object Exchange Model

(OEM) [PGMW95] of the Stanford Database Group, a simple semistructured

data model. It has also been used in other projects of that group, such as LORE

[AQM+ 97]. The Mediator Specification Language (MSL) uses a syntax sim-

ilar to datalog but which has been extended to the semistructured paradigm

[ABS00, TMD92, PGMW95].

3.6. LOCAL-AS-VIEW INTEGRATION 53

**Mediator definitions are declaratively specified and can then be compiled down
**

to mediators. Of course, such mediator definitions can only be changed (or new

mediators added) offline, that is, changes require the definitions to be recompiled.

The semistructured data model and query language used in TSIMMIS also

allows for data sources that only supply data for some of the attributes in a

mediator interface, which we cannot appropriately match with relational database

views in the spirit of Example 3.5.1. For instance, it is possible to define mappings

of two sources s1 , s2 against a mediated relation r by

∀x, y ∃z : s1 (x, y) → r(x, y, z) ∀x, z ∃y : s2 (x, z) → r(x, y, z)

which would not satisfy the range restriction requirement when expressed as a

pair of conjunctive logical views. Given knowledge that the first attribute of r

functionally determines the other two (as object identifiers in OEM of course do),

the two views could nevertheless be used to answer a query such as q(x) ← r(x, y)

by compiling the above mappings into a mediator for the view

r(x, y, z) ← s1 (x, y), s2(x, z).

**More Research Systems
**

The HERMES system [ACPS96] is another mediator toolkit that aims at a wide

goal of providing a complete methodology for source integration. The design of

the system has taken special care to permit the integration multimedia sources.

The system supports parameterized procedure calls that may be defined for ac-

cessing restricted sources and are then used by HERMES mediators to answer

queries. The Garlic system [CHS+ 95] is a research prototype that, similarly to

HERMES, aims at integrating multimedia sources. Other systems that clearly

fall into the global-as-view category and that we have shortly touched earlier were

CARNOT and multi-agent systems such as KRAFT [PHG+ 99].

**3.6 Local-as-view Integration and the Problem
**

of Answering Queries Using Views

Local-as-view integration (LAV) is strongly related to the database-theoretic

problem of answering (rewriting) queries using views [YL87, LMSS95, DGL00,

AD98, BLR97, RSU95, PV99, SDJL96, PL00, CDLV00a], which will be discussed

in more detail in this section.

Within data integration, the local-as-view approach is applied in global infor-

mation systems architectures. Influential LAV data integration systems include

the Information Manifold [LRO96], InfoMaster [GKD97], and SIMS [AAA+ 97,

AK92, HK93]. Beyond data integration, the problem of answering queries us-

ing views has also been found relevant for query optimization [CKPS95] (where

54 CHAPTER 3. DATA INTEGRATION

**previously materialized queries are used to answer similar queries14 ), the mainte-
**

nance of physical data independence [TSI94], and Web-site management systems

[FFKL98].

**3.6.1 Answering Queries using Views
**

The local-as-view approach is based on the notion of a “global” mediated schema,

that is, a specially designed integration schema. The content of “local” sources

is described by logical views in terms of the predicates of the “global” schema

(thus the term local-as-view). Given “global” predicates p1 , . . . , pn and a source

v, a LAV view can be defined as

v(X̄) ← p1 (X̄1 ), . . . , pn (X̄n ).

**Assuming a query over global predicates p1 , . . . , pm , this query can be auto-
**

matically rewritten by the system to contain only source predicates (such as v)

instead of the global predicates.

For the purpose of data integration, we consider only the case where one

searches for complete rewritings, which are rewritings in which all global pred-

icates have been replaced by views. We aim at producing rewritings that are

minimal . A conjunctive query Q is minimal if there is no conjunctive query

Q′ such that Q ≡ Q′ and Q′ has fewer subgoals than Q (see Section 2.4). For

the minimality of positive queries (as sets of conjunctive queries) we furthermore

require that conjunctive member queries are pairwise nonredundant, i.e. for a

positive query {Q1 , . . . , Qn }, we require Qi 6⊆ Qj and Qj 6⊆ Qi for each pair

i, j ∈ {1, . . . , n}.

One can either attempt to find equivalent rewritings or maximally contained

rewritings. Given a conjunctive query Q and a set of conjunctive views V, an

equivalent rewriting Q′ – if it exists – is a conjunctive query Q′ that only uses

the views and which, when expanded with the views, is equivalent to Q. Given a

conjunctive query Q and a set of conjunctive views V, Q′ is a maximally contained

rewriting15 (w.r.t. the positive queries16 ) if and only if each member query is, when

expanded using the views, contained in Q and there is no conjunctive query Q′′

s.t. when expanded with the views, it is contained in Q but Q′′ is not contained

in any of the member queries of Q′ . In general, it is not always possible to

14

The problem of answering queries using views is thus indirectly important to global-as-view

data integration approaches such as data warehousing as well.

15

Note that our definition of maximally contained rewritings is different from Levy’s [PL00]

where a rewriting is only maximally contained if it has the properties we enumerate and there

is at least one database for which the result of the original query is strictly larger than the

result of the rewriting. Under our definition, however, equivalent rewritings are also maximally

contained.

16

Maximally contained rewritings need to be defined relative to a query language.

3.6. LOCAL-AS-VIEW INTEGRATION 55

**find an equivalent rewriting, and the maximally contained rewriting – as a set of
**

conjunctive queries – may be empty.

Equivalent rewritings require that views be complete, as is usually the case

for true materialized database views. In a data integration setting, it is usually

appropriate to consider sources to be possibly incomplete.

**Example 3.6.1 [Ull97] Suppose we have a global schema with a virtual pred-
**

icate p (“parent of”), a query

q(x, y) ← p(x, u), p(u, v), p(v, y).

**and two sources s1 (“grandparent of”) and s2 (“parent of someone who is also a
**

parent”). We can define the following logical views

s1 (x, z) ← p(x, y), p(y, z).

s2 (x, y) ← p(x, y), p(y, z).

Let us first assume that the two views are complete, i.e. that they logically

correspond to the constraints

{hx, zi | s1 (x, z)} ≡ {hx, zi | ∃y : p(x, y) ∧ p(y, z)}

{hx, yi | s2 (x, y)} ≡ {hx, yi | ∃z : p(x, y) ∧ p(y, z)}

There is an equivalent rewriting of q:

q ′ (x, z) ← s2 (x, y), s1 (y, z).

**Now if we assume that our views are incomplete sources in a data integration
**

system, they correspond to the logical constraints

{hx, zi | s1 (x, z)} ⊆ {hx, zi | ∃y : p(x, y) ∧ p(y, z)}

{hx, yi | s2 (x, y)} ⊆ {hx, yi | ∃z : p(x, y) ∧ p(y, z)}

meaning that s1 is a source of grandparent relationships and s2 is a source of

parent relationships where the children are themselves parents, but both sources

do not necessarily provide all such relationships (although they only provide such

relationships). The implication direction of the conjunctive views shown above is

thus somewhat misleading, while the constraints based on set-theoretic notation

employed above are exact.

It is possible to show that the following positive query is a maximally con-

tained rewriting (as a set of conjunctive queries) of q that only uses the (incom-

plete) views s1 and s2 :

q ′ (x, z) ← s1 (x, y), s2 (y, z).

56 CHAPTER 3. DATA INTEGRATION

q ′ (x, z) ← s2 (x, y), s1 (y, z).

q ′ (w, z) ← s2 (w, x), s2(x, y), s2 (y, z).

Note that this rewriting is also nonredundant and minimal in the sense that we

cannot remove any member queries or subgoals and retain a maximally contained

rewriting.

**It can be shown that if both q and the views in V are conjunctive queries
**

(CQs) without arithmetic comparison predicates, then it is sufficient to consider

only rewritings with at most as many subgoals (views) as the original query

[LMSS95] as candidates for both equivalent and maximally contained rewritings.

(See also [Ull97].) A naive algorithm for finding an equivalent rewriting is thus

to guess an arbitrary rewriting Q′ of Q with at most as many subgoals as in Q

which uses only the views in V and then to check if Q′ is equivalent to Q. For

maximally contained positive rewritings, one can incrementally build a maximal

set of rewritings by searching the whole space of such rewritings (which is finite).

The problem of answering queries using logical views is NP-complete already

in the simple case of conjunctive queries without arithmetic comparison predicates

[CM77, LMSS95]. Thus this is a presumably hard reasoning problem. However,

it spares the human designer from having to carry out the rewriting task by

hand17 . For more expressive classes of query languages, the problem is harder or

undecidable [vdM92, SDJL96, CV92, Shm87].

3.6.2 Algorithms

Several improvements over the naive query rewriting algorithm have been pro-

posed, among them the Bucket algorithm of the Information Manifold [LRO96],

the Inverse Rules algorithm [DG97] of the InfoMaster System [GKD97], the

MiniCon algorithm [PL00], OCCAM [KW96], and the Unification-join algorithm

[Qia96]. We will discuss three of these algorithms in more detail.

The Bucket algorithm uses the following simple optimization over the naive

algorithm. For each of the subgoals of a given query, each of the views is indepen-

dently checked if it is possibly relevant to the process of replacing that subgoal.

Such candidate views are collected in “buckets”, one for each subgoal. Exhaus-

tive search is then carried out in the cartesian product of the buckets. Thus the

necessary search space required for combining the views in the buckets is pruned

compared to the naive algorithm.

The inverse rules algorithm first transforms the views into Horn clauses. The

queries can then be answered by executing the combination of the query and the

Horn clauses representing the views as a logic program, in a bottom-up fashion.

17

In global-as-view integration, on the other hand, mediators have to be specially designed

in order to be able to answer a certain repertoire of queries.

3.6. LOCAL-AS-VIEW INTEGRATION 57

**Example 3.6.2 In the inverse rules algorithm, the views of Example 3.6.1 (under
**

the incomplete views semantics) correspond to the Horn Clauses

p(x, f1 (x, z)) ← s1 (x, z). p(f1 (x, z), z) ← s1 (x, z).

p(x, y) ← s2 (x, y). p(y, f2(x, y)) ← s2 (x, y).

Given instances s1 = {ha, ci, hb, di} and s2 = {ha, bi, hb, ci}, we can derive

p(a, f1 (a, c)), p(f1 (a, c), c), p(b, f1 (b, d)), p(f1 (b, d), d),

p(a, b), p(b, f2 (a, b)), p(b, c), p(c, f2 (b, c))

and finally q(a, d) as the answer to the query of Example 3.6.1.

**Such a logic program can be transformed into an equivalent (function-free)
**

nonrecursive datalog program, which can be unfolded into a set of conjunctive

queries using a simple transformation [DG97].

The MiniCon algorithm uses information about variables occurring in queries

for finding maximally contained rewritings. The MiniCon algorithm is based on

the notion of MiniCon descriptions (MCD)18 .

**Definition 3.6.3 Given a conjunctive query Q and a set of views V, an MCD
**

m is a tuple hVkm , hm , Gm , φm i of

• A view Vkm ∈ V.

• A head homomorphism 19 hm on the view Vkm .

• A set Gm ⊆ Body(Q) of subgoals of Q.

**• A function φm : V ars(Gm ) → V ars(Vkm ) that maps the variables in the
**

subgoals Gm of Q into the variables of Vkm .

that satisfies the following properties.

**• For each g ∈ Gm , there is a subgoal of our view such that φm (g) ∈
**

Body(hm(Vkm )). (Gm is not necessarily the largest such set of subgoals

of Q.)

18

Informally speaking, an MCD represents a fragment of a containment mapping from the

query to its rewriting encompassing only the application of a single view and which is in a sense

atomic.

19

A head homomorphism h : V ars(V ) → V ars(V ) is a mapping of variables that is the

identity h(v) = v on variables not in the head of the view and maps head variables to head

variables; more exactly, a head variable v ∈ V ars(Head(V )) is either mapped to itself (h(v) =

v) or to another head variable for which h is the identity, i.e.

h(v) = w, w ∈ V ars(Head(V )), h(w) = w

58 CHAPTER 3. DATA INTEGRATION

q(x, y) ← p(x, u), p(u, v), p(v, y).

φ(x) φ(u) φ(u) φ(v) φ(v) φ(y)

— — — — — —

m1 : s1 (x, z) ← p(x, y), p(y, z).

m2 : s1 (x, z) ← p(x, y), p(y, z).

m3 : s2 (x, z) ← p(x, y), p(y, z).

m4 : s2 (x, z) ← p(x, y), p(y, z).

m5 : s2 (x, z) ← p(x, y), p(y, z).

Figure 3.6: MiniCon descriptions of the query and views of Example 3.6.1.

• For each variable v ∈ V ars(Head(Q)), φm (v) ∈ Head(hm (Vkm )).

• For each variable v ∈ V ars(Q) for which

φm (v) ∈ V ars(hm (Vkm )) − V ars(Head(hm (Vkm )))

**(i.e., φm (v) is among the existentially quantified variables20 of the head
**

homomorphism on the view), all other subgoals in Q that contain v are in

Gm .

**• m is minimal in the sense that there is no subset of Gm s.t. the previous
**

property remains true.

**• hm is the least restrictive head homomorphism necessary in order to allow
**

the view and query subgoals to be unified.

This is best explained with an example.

**Example 3.6.4 For the query and the views of Example 3.6.1, there are five
**

MCDs (see Figure 3.6). Note that for all MCDs and variables, their head homo-

morphism is the identity (id(v) = v for all variables in the respective view), so

we do not explicitly state it. For brevity, let g1 , g2, g3 denote the three subgoals

of Q.

m1 = hs1 , id, G1 = {g1 , g2 }, φ1i with φ1 (x) = x, φ1 (u) = y, φ1(v) = z.

20

In the data integration setting, these are thus the attributes that were projected out in the

materialized views. Data for them are not available and the variables bound to these attributes

cannot only not be bound to head variables of the query but also must not occur in any subgoals

of Q left to be covered by other MCDs to produce a rewriting. This would require a join of two

source views by attributes that are “not available”.

3.6. LOCAL-AS-VIEW INTEGRATION 59

**m2 = hs1 , id, G2 = {g2 , g3 }, φ2i with φ2 (u) = x, φ2 (v) = y, φ2(y) = z.
**

m3 = hs2 , id, G3 = {g1 }, φ3i with φ3 (x) = x, φ3 (u) = y.

m4 = hs2 , id, G4 = {g2 }, φ4i with φ4 (u) = x, φ4 (v) = y.

m5 = hs2 , id, G5 = {g3 }, φ5i with φ5 (v) = x, φ5 (y) = y.

**Given the set M of all MiniCon descriptions for a query Q and a set of views
**

V, all conjunctive queries that have to be considered for a maximally contained

positive rewriting of Q can be constructed from combinations m1 , . . . , mk of ele-

ments of M for whichVthe sets Gm1 . . . Gmk are a disjoint n-partition of the set of

all subgoals in Q, i.e. (Gm1 ∪. . .∪Gmk ) = Body(Q) and Gmi ∪Gmj = ∅ for each

pair i, j in 1 . . . k. Note that one also does not have to compute any containment

mappings as needed in the Bucket algorithm anymore, as this is already implicit

in the combination of hi and φi .

**Example 3.6.5 Let M be the set of five MCDs of the previous example. There
**

are three n-partitions of {g1 , g2 , g3} using G1 . . . G5 , namely {G1 , G5 }, {G3 , G2 },

and {G3 , G4 , G5 }. The rewritings producible from these partitions are those of

Example 3.6.1.

**We arrive at the maximally contained rewriting of Q by transforming each
**

of the partitions in the following way. Let {m1 , . . . , mn } be such a partition.

We apply φ−1i (hi (Vki )) for each MCD mi and combine the transposed views by

conjunction into conjunctive queries. For those variables of a view for which φ−1

is undefined, i.e. variables that only appear in subgoals of the view that are not

matched with any of the subgoals of the query in the MCD, new variable names

need to be invented.

Note that none of the three algorithms that we have discussed directly pro-

duces rewritings that are guaranteed to be minimal, so results have to be sepa-

rately optimized to obtain this property.

The Inverse Rules algorithm in its original formulation produces a datalog

rewriting, and rewritten views are kept separate from queries. In the case of

the rewriting of conjunctive queries, the rewriting process thus defers part of the

activity carried out by the other two algorithms to the time of query execution.

To compare this algorithm with the others, it is thus necessary to unfold the dat-

alog program produced by the Inverse Rules algorithm using the transformation

of [DL97b] (also discussed in Section 7.2) or include query execution into the

performance consideration.

Given moderately sophisticated techniques for executing datalog queries, the

Inverse Rules algorithm performs better than the brute-force bucket algorithm.

The MiniCon algorithm, which takes into account more problem-specific knowl-

edge and thus reduces the amount of redundant computations, however, in prac-

tice outperforms even the Inverse Rules algorithm in the altered form that unfolds

the rewritings into sets of conjunctive queries [PL00].

60 CHAPTER 3. DATA INTEGRATION

3.6.3 Bibliographic Notes

The theory of answering queries using views is surveyed in [Hal00] and [Lev00].

It is strongly related to the query containment problem, and is usually at least as

hard. The exception is the problem of answering datalog queries using conjunctive

views, which is efficiently solvable [DG97], while the related containment problem

is undecidable [Shm87]. On the other hand, the solution proposed in [DG97] does

not apply query rewriting in the strong sense21 .

The query rewriting problem in the presence of arithmetic comparison predi-

cates in the query and views has been addressed in [LMSS95] for the case of equiv-

alent rewritings. For the case of maximally contained rewritings, it is known that

no complete algorithm can exist, not even one that produces a recursive rewrit-

ing [AD98]. A sound algorithm that covers many practically important cases,

however, is presented in [PL00].

Queries with aggregation are addressed in [SDJL96]. The problem of an-

swering queries using views in object-oriented databases and OQL [CBB+ 97] has

been addressed in [FVR96]. The same problem in the case of regular path queries

in semistructured data models is discussed in [CDLV99, CDLV00b, CDLV00a].

The problem of answering queries using views with functional dependencies (over

the global predicates) has been addressed in [LMSS95] for the case of equivalent

rewritings, where the bound on the maximal number of subgoals only needs to be

slightly extended to the sum of the number of the subgoals in the original query

plus the sum of the arities of the subgoals. Maximally contained rewritings for

the same case may need to be recursive [DL97b, DGL00].

**Binding Patterns (Adornments)
**

The problem of answering queries using views with binding patterns derives its rel-

evance from the fact that many sources in data integration systems have restricted

query interfaces. This is the case for legacy systems as well as for screen-scraping

Web interfaces where certain chunks of information may need to be provided s.t.

queries can be executed (e.g. book titles in online book stores). These restrictions

can be conveniently modeled using binding patterns22 .

A binding pattern is a mark telling, for each argument position of the predi-

cate, whether it is bound or free. At query execution time, variables in argument

positions marked “bound” have to be bound to constants before the extent of the

predicate is accessed (i.e., the source is queried).

**Example 3.6.6 Consider the query q b,f (x, z) ← p(x, y), p(y, z). which requires
**

(and guarantees) that the variable x will be bound to a constant when executed.

21

We will apply this technique for rewriting recursive queries in Section 7.2.

22

Binding patterns or adornments have been used elsewhere, for instance in the theory of

optimizing recursive queries [Ull89].

3.6. LOCAL-AS-VIEW INTEGRATION 61

**Furthermore, we have a view v b,f (x, y) ← p(x, y). for a source that can only
**

answer queries when provided “input” in its first attribute position. The query

can be rewritten into q b,f (x, z) ← v(x, y), v(y, z). However, the query q f,b (x, y) ←

p(x, y). cannot be rewritten because the only available source does not allow to

access p tuples without providing input for the first attribute position.

**Binding patterns allow the integration of data transformation functions into
**

the rewriting process, where input arguments of such functions are modeled as

“bound” and output arguments as “free”.

For the problem of computing equivalent rewritings given sources with binding

patterns, the search space is larger than in the case of the problem of answering

queries using views without binding patterns [RSU95], but the problem remains

NP-complete. Maximally contained rewritings may not be expressible as finite

sets of conjunctive queries, but can be encoded as recursive datalog programs

[KW96, DL97b, DGL00].

Algorithms and results bounding the search for equivalent rewritings have

been presented in [RSU95]. Earlier, queries with ”foreign functions” were consid-

ered in the context of query optimization in [CS93]. The Information Manifold

[LRO96], a system for integrating Web sources, supports source descriptions with

binding patterns that permit the specification of input and output attributes.

They are meant to facilitate the integration of sources that do not have full rela-

tional query capabilities, such as legacy sources or screen-scraping Web interfaces.

**Answering Queries using Views under the Closed World Assumption
**

Note that we have so far discussed the problem of answering queries using views

in the light of an open-world assumption, which is appropriate in the context

of data integration and the assumption that sources may provide incomplete

information. It is also possible to approach the problem under a closed-world

semantics, centered around the notion of certain answers [AD98].

**Example 3.6.7 Consider a query q(x, y) ← p(x, y). and sources v1 (x) ← p(x, y).
**

and v2 (y) ← p(x, y). Under the open-world assumption, this query cannot be

answered. Let the extents of v1 and v2 now be v1 = {hai} and v2 = {hbi}. Under

the closed-world assumption, we have the certain answer ha, bi to the query,

because the projections of the tuples in the extent of p are complete and entail

that certain answer.

**This problem and its complexity are discussed in [AD98, MLF00]. Note that
**

the problem of answering queries using views under the closed-world assumption

has the practical disadvantage that reasoning can only be done relative to the

data rather than the query, thus leading to a scalability problem.

62 CHAPTER 3. DATA INTEGRATION

**3.7 Description Logics-based Information Inte-
**

gration

3.7.1 Description Logics

Description logics 23 (DL), also known as terminological logics or concept lan-

guages, are structured logical languages that are based on a well-designed tradeoff

between expressive power and complexity. The main goal is the design of lan-

guages that allow to conveniently express a large number of practical problems

related to concepts and objects while still remaining decidable24 . They can be

motivated by semantic networks, frame languages, terminological reasoning, and

semantic and object-oriented data models [RN95].

Description logics are usually constructed from unary relational predicates

(called concepts or concept classes) and binary relations (roles or attributes).

Instances of concepts are usually called individuals. Description logics are de-

fined by a fixed set of logical constructors, such as concept intersection C1 ⊓ C2 ,

union C1 ⊔ C2 and negation ¬C, all-quantification of roles with qualification ∀R.C

(denoting the concept {hxi | ∀y : R(x, y) → C(y)}), existential quantification,

which may (∃R.C, denoting {hxi | ∃y : R(x, y) ∧ C(y)}) or may not (∃R) sup-

port qualification, the conjunction and union of roles, the concatenation of roles

R1 ◦ R2 , number restrictions on roles ((≤ nR) and (≥ nR), where n is a constant

integer), and others25 . More complex concepts and roles are defined inductively

from atomic concepts and roles using the provided constructors.

Constraints are of the form C1 ⊑ C2 or C1 ≡ C2 , where C1 and C1 are concepts.

Constraints are subsumption (logical “containment” of the extents of the expres-

sions) relationships between concepts. For instance, the subsumption relationship

C1 ⊑ C2 expresses the logical constraint ∀x : C1 (x) → C2 (x).

The semantics of the languages are the straightforward classical logical one

applied to the special syntactical peculiarities of such languages. The syntax of

23

We restrict the presentation of description logics to a short overview. For a more detailed

introduction to this area see e.g. [DLNS96] or [Fra99].

24

However, the ancestor of description logics systems KL-ONE [BS85] was found not to have

this property [SS89]. The culprit was the same-as constructor, which allows to express concepts

of the form

∀y1 , y2 : (R1 (x, y1 ) ∧ R2 (x, y2 )) → y1 = y2

This constructor makes description logics lose the tree-model property [Var97] and their cor-

respondence with modal logics, and causes already the simplest and most restricted description

logics to become undecidable (see e.g. [DLNS96]). This problem was fixed in the successor sys-

tem CLASSIC [BPS94, BBMR89b] by a slight change of the semantics of extents (a “hack”).

Note also that the LOOM system [MB87], which is often listed among description logics sys-

tems, provides an incomplete reasoning service over a very expressive logical language.

25

These constructors are motivated by the ALC family of languages [SSS91, DLNS96]. See

[PSS93] for a standardization effort.

3.7. DESCRIPTION LOGICS-BASED INFORMATION INTEGRATION 63

**most description logics languages differs from the classical syntax of first-order
**

logics because constraints in such concept languages usually can be expressed in

a variable-free form.

The main reasoning problems in description logics systems are subsumption

and classification. Subsumption is the logical implication problem in description

logics languages on the level of concepts. Given a set of constraints Σ in a DL

language, subsumption is the problem of deciding whether Σ implies the truth

of the logical formula corresponding to an additional constraint C1 ⊑ C2 . In

other words, this is the problem of deciding whether Σ implies that concept C1

is contained in C2 . The classification problem is to decide whether a certain

individual belongs to a given concept class.

**3.7.2 Description Logics as a Database Paradigm
**

Description logics systems have been discussed as database systems before, e.g.

in the context of CLASSIC [BBMR89a, Bor95] and DLR [CDL98a, CDL+ 98b,

CDL99]. Description logics are relevant to data integration in two ways. Given

that queries are expressed as concepts and constraints express inter-schema rela-

tionships such as views,

**• concept subsumption can be used to decide query containment under con-
**

straints and

**• the classification of individuals (the objects of a database or a set of het-
**

erogeneous databases) can be used for answering queries in heterogeneous

databases.

**Apart from that, description logics have been used to verify the consistency
**

of schemata [FN00]. Let us consider description logics subsumption and class-

ification as a way of performing data integration.

**Example 3.7.1 Consider the following set of three constraints.
**

GrandparentOrNoParent ≡ Person ⊓ ∀child.(∃child.⊤)

ParentOfFerrariDriver ≡ Person ⊓ ∃child.(∃drives.Ferrari)

Ferrari ⊑ ItalianCar

Given our data integration setting, let GrandparentOrNoParent and Par-

entOfFerrariDriver be database relations with an extent (data sources). “child”

and “drives” are roles. The first constraint describes individuals of the class

GrandparentOrNoParent as persons whose children, if they have children, are

parents themselves. The name of the second source speaks for itself. In the third

constraint, we define Ferraris as Italian cars (that is, the concept class Ferrari is

a subclass of the class of Italian cars). Now let us ask a query for all persons who

have children that drive Ferraris and have children themselves.

64 CHAPTER 3. DATA INTEGRATION

Person ⊓ ∃child.((∃child.⊤) ⊓ (∃drives.Ferrari))

**This query can be answered by attempting to classify all the individuals known
**

to the system. The answer will be the set of individuals that belong to both the

classes GrandparentOrNoParent and ParentOfFerrariDriver. It is also derivable

that our constraints imply

Person ⊓ ∃child.((∃child.⊤) ⊓ (∃drives.Ferrari)) ≡

GrandparentOrNoParent ⊓ ParentOfFerrariDriver

Also, we can determine that our set of constraints implies the subsumption

Person ⊓ ∃child.((∃child.⊤) ⊓ (∃drives.ItalianCar)) ⊒

GrandparentOrNoParent ⊓ ParentOfFerrariDriver.

but not equivalence.

**Note that the constraints of the previous example clearly follow a local-as-
**

view pattern26 . In general, however, constraints in description logics are truly

symmetric (In constraints of the form C1 ⊑ C2 or C1 ≡ C2 , both C1 and C2 may be

complex composed concept definitions representing queries), allowing to combine

global-as-view and local-as-view integration.

Recently, two kinds of extensions to the ALC-style languages (for which decid-

ability is of course preserved) have been proposed. Firstly, there has been work

on defining concepts using fixpoints for e.g. transitive roles (µALCQ [DL97a],

[HM00]) and that allow to express general regular path expressions, as they are

important in the context of queries over semistructured databases (for instance,

see the expressive description logic DLR [CDL98a, CDL+ 98b, CDL99]). Sec-

ondly, description logics (e.g., again, DLR) have dropped the requirement that

roles be binary relations. Instead, arbitrary relations may be used but have to

be projected down to binary before being used in constraints.

The restrictions and drawbacks of description logics for data integration are

threefold.

**• Description logics provide two kinds of reasoning, query answering [CDL99]
**

by the classification of data and the verification of query containment by

subsumption. They do not lend themselves to query rewriting, however.

While it is possible to check, given a rewriting, if it is contained in the

input query, there is in general no way of finding such a rewriting given only

the input query. Query answering, however, is impractical, as it requires

all the data available in the system to be imported into the description

26

Note that description logics-based data integration is sometimes considered a case of local-

as-view integration. We kept the discussion separate to leave the work on the problem of

answering queries using views to its own section.

3.8. THE MODEL MANAGEMENT APPROACH 65

**logics system, where each data object has to be independently classified for
**

membership in the concept class described by the query. This does not scale

to large databases and may not be feasible because data sources may have

restricted (e.g. screen-scraping) interfaces or be legacy systems, rendering

it impossible to extract “all” their data.

• Query languages are restricted to tree-style queries without any circulari-

ties. (Consider our earlier comment on same-as constraints and the entailed

undecidability.) For instance, this excludes simple queries such as

q(x) ← parent(x, y), employer(x, y).

**3.7.3 Hybrid Reasoning Systems
**

For efficiency reasons, recent description logics systems (e.g. KRIS [BH91],

BACK [vLNPS87, NvL88], KRYPTON [BPGL85] and FaCT [Hor98]) have sepa-

rated the reasoning with concepts (TBox reasoning) from the reasoning with indi-

viduals (ABox reasoning [HM00]), using different techniques for the two problems

and creating hybrid reasoning systems [Neb89].

Hybrid knowledge representation systems have also been built by combining

description logics reasoning with deductive databases and nonmonotonic rea-

soning [DLNS98, Ros99] or local-as-view integration using database techniques

[LR96]. The Information Manifold [LRO96, BLR97], a local-as-view system with

query rewriting based on the Bucket algorithm uses the description logics CARIN

[LR96] to constrain concepts used in source descriptions (views)27 .

**3.8 The Model Management Approach
**

The vision of the model management approach is to represent schemata and inter-

schema mappings as first-class objects in a repository28 [BLP00]. This approach

allows to define powerful operations on schemata and mappings such as the un-

folding (concatenation) of mappings and the application of mappings to schemata

in order to transform them.

Model management permits the computer-aided manipulation of such meta-

data using easy-to-use graphical user interfaces, as demonstrated by both research

systems (e.g. Clio [MHH+ 01], ONION [MKW00]) and commercial systems such

as Microsoft Repository [BB99]. The OBSERVER system [MIKS00] manages

several heterogeneous ontologies and mappings between them in a repository and

may be considered to be another pursuer of the model management approach.

27

This is an alternative role of description logics systems in data integration.

28

This relates to interesting research on logical languages for reasoning about schemata (e.g.

F-Logic [KL89], HiLog [CKW89], and Telos/ConceptBase [JGJ+ 95]) and meta-data query lan-

guages [LSS99, RVW99].

66 CHAPTER 3. DATA INTEGRATION

**Schema matching techniques [MHH+ 01, MZ98, MKW00] have been used in
**

such systems for defining mappings between schemata. Most work in this area is

based on the definition of correspondences between schema objects (e.g. classes,

attributes, or relationships), often graphically, by drawing lines between them

[MKW00, MZ98, BLP00, MHH+ 01]. The formalisms for defining mappings have

often been quite restrictive, and agreed-upon semantics have not yet developed.

Systems such as Clio [MHH+ 01] propose several alternative semantics for such

correspondences for users to choose among.

Schema matching has also been used for XML data transformation [MZ98].

For data integration, these approaches have the drawback that the integration

problem is solved by processing the data rather than transforming the queries,

thus leading to a scalability problem.

3.9 Discussion of Approaches

Quality Factors of Data Integration Architectures

In this chapter, we have encountered a number of data integration architectures.

Given the integration problem motivated in Chapter 1, some of the main questions

regarding the quality of data integration architectures are

**• Does the approach apply query rewriting or query answering? This is im-
**

portant because if the output of the data integration process is a query

which can be independently optimized and reformulated, performance im-

provements are possible that otherwise would not be attainable. The sepa-

ration is also important because in some approaches, the complexity of inte-

gration by data transformation is much harder than just executing queries

arriving at the same results, if such queries exist and can be computed. Fi-

nally, such a separation allows to select the best implementations for both

problems – core data integration and query evaluation – independently.

**• Does the approach use a global schema against which all sources are in-
**

tegrated, or may there be several different schemata against which data

integration is carried out? The first approach may be preferable from a

standpoint of managing mappings. If there is only a single integration

schema, fewer mappings may be needed than if there are many. Note that

given m integration schemata and n sources, of the order of m ∗ n mappings

may be needed to integrate them. (That makes m2 in a federated database

system.) Clearly, a global integration schema (m = 1) is usually preferable

over a quadratically growing number of mappings.

However, the integration problem may require support for multiple au-

tonomous integration schemata, which may evolve independently. Change

of requirements may lead to the evolution of schemata against which data

3.9. DISCUSSION OF APPROACHES 67

**Global-as-view/procedures Local-as-view
**

Management Problematic: change of a Good

of change to single source may require

sources the redesign of (many) me-

diators/procedures

Management Problematic: coupling of Problematic: change of

of change of mediator interfaces global schema requires

requirements global redesign of views

Figure 3.7: Comparison of global-as-view and local-as-view integration.

**are integrated. If there are many mostly independent integration problems,
**

it may be preferable to avoid the creation of a single global schema. If there

are several smaller schemata and only one of them needs to be changed, one

can expect that fewer mappings will be affected.

**• How stable and reusable are mappings when change occurs? Given a large
**

information infrastructure that needs to be managed, one does not want

changes to propagate through the system further than absolutely neces-

sary, invalidating other components that then need to be changed as well.

Subsystems should be largely decoupled, making changes manageable. Al-

ternatively, if changes do need to occur, it should be possible to automate

them as far as possible.

There are two kinds of changes that we want to differentiate between, the

change of sources and the change of integration requirements (or the evo-

lution of an integration schema or “global” schema).

**• How well does the approach support the mapping of sources and integration
**

schemata that show serious concept mismatch? As we will show later in

this section, procedural approaches as well as simple view-based approaches

have their restrictions with respect to this issue, which are more severe

than it may appear at first sight. Declarative approaches with symmetric

constraints are those most desirable and complete.

**Global-as-view versus Local-as-view Integration
**

Let us first compare local-as-view and global-as-view integration. A major ad-

vantage of local-as-view mappings is their maintainability (Figure 3.7). When

sources are added or removed, change remains local to those logical views that

define these sources. GAV mediators may require a major global redesign when

sources change, which may propagate through many mediators. Once a global

integration schema has been defined for LAV, this schema allows good decou-

pling between sources and the global information system, which is essential if

68 CHAPTER 3. DATA INTEGRATION

**Query Global “Declarative” Symmetric
**

rewriting? schema? approach? constraints

Federated Databases no (?) no no no

Data Warehousing yes/no yes no no

Mediator systems no (yes) no no

Global Inf. Systems yes yes yes/no no

Description Logics no no yes yes

Model management yes/no no no (?) no

Figure 3.8: Comparison of Data Integration Architectures.

**ease of change is an issue. However, designing an appropriate global schema
**

for local-as-view integration is hard, and requires a good understanding of the

domain. Furthermore, the application of the local-as-view approach is only rea-

sonable if the overall goals and requirements of the global information system

do not change; otherwise, the global schema as well as all defined logical views

may quickly become invalid and require complete redesign. The interfaces that

GAV mediators export, on the other hand, often follow quite straightforwardly

and naturally from the sources that have to be combined.

LAV has sometimes been called a declarative method, and GAV procedural.

Indeed, the design of “schemata” that global-as-view mediators export are usually

more restrictive as to what kinds of queries can be asked than in LAV, where less

knowledge about how queries are answered is put into the views at design time

and more is decided at runtime. Indeed, LAV takes a more global perspective

when answering queries than GAV (the overall integration schema becomes a

mediated schema [Lev00]).

As pointed out earlier in Chapter 1, both the local-as-view and the global-as-

view approach make a very important assumption. It is supposed that the “global

schemata” resp. interfaces exported by mediators29 can be designed at will for the

special purpose of integrating a number of sources. Either approach fails if this

assumption does not hold. For instance, consider the case of Example 3.6.1. We

cannot build a GAV mediator that answers any queries using the given sources

if we are required to export a “parents”(-only) interface. Conversely, imagine

source relations containing attributes that have no analog in the global logical

schema in the case of LAV.

Comparison of Architectures

Now consider Figure 3.8, in which we compare the data integration architectures

discussed in this chapter.

29

These are in a sense “global” as well, because if they are not general enough, they will have

to be redesigned when further sources are added to a mediator.

3.9. DISCUSSION OF APPROACHES 69

**• Federated databases support the autonomy of component databases. There
**

is thus no central “global” schema in the architecture. Traditionally, data

have been translated procedurally between schemata, although this is in

principle not a necessity.

**• In the data warehousing architecture, sources are integrated against a single
**

sophisticated global warehouse schema. Integration is usually global-as-

view and procedural.

**• Mediators à la [Wie92] apply query answering in a procedural manner. Al-
**

though mediators in systems such as TSIMMIS [GMPQ+ 97] are specified

declaratively, these specifications are compiled down into software compo-

nents that answer queries on the level of data. Global-as-view integration

by database views is based on query rewriting. However, mediators do not

take a global perspective with respect to the schema as known from local-

as-view and description logics integration. Although database views can be

considered as constraints under a declarative semantics, no global reasoning

under this semantics will lead to more complete results than just using the

views independently.

Mediators independently export interfaces according to which they can pro-

vide integrated data. Mediators making use of the services of other me-

diators are strongly coupled via their interfaces (see Figure 3.7). While

the mediator architecture at first sight does not rely on a global schema,

this coupling entails the usual disadvantages of global schemata, namely

that changes of requirements may lead to the need of a global, very work-

intensive redesign of many components (mediators) of the system.

**• Global-information systems may either use GAV or LAV integration. The
**

first case is not substantially different from the mediator approach just

discussed. Local-as-view integration has been discussed in sufficient detail

earlier.

**• Description logics system use a declarative approach with symmetric con-
**

straints, allowing to encode both mappings usually to be considered local-

as-view and such usually considered global-as-view. The designer may ef-

fectively define a global schema against which all sources are integrated,

but is free to do otherwise. Unfortunately, the approach does not only

rule out query rewriting, worse, there is usually a high data complexity for

answering queries, compromising scalability.

**• The model management approach at the core leaves open which integration
**

technology is to be used. While state-of-the-art research often uses very

restrictive mappings with a somewhat declarative flavor, one is free to make

70 CHAPTER 3. DATA INTEGRATION

**other choices. Since integration schemata are just objects among many, no
**

global schema strategy can be observed.

**We are now in the position to apply the lessons learned from previous work
**

to our problem of Section 1.2.

Chapter 4

**A Short Sightseeing Tour
**

through our Reference

Architecture

4.1 Architecture

The data integration architecture of Figure 4.1 will be made our reference for the

presentation of the contributions of this thesis. It contains a number of informa-

tion systems that retain design autonomy for their schemata, data models, and

query languages. Each information system may contain a number of databases

and processes which access and manipulate local data. For simplicity, but with-

out loss of generality, we assume the information systems to logically each contain

a single database over a single consistent schema. Other cases are handled by

either using distributed database techniques locally or splitting one information

system up into several systems that are considered independent for data inte-

gration purposes. Schemata may contain both true “source” entities for which

the local database holds data and logical entities over which local queries can be

executed as well, but for which it is the data integration system’s task to gather

and provide mediated data from other information systems.

Component information systems may be structurally heterogeneous. In order

to make integration possible, the overall information infrastructure of Figure 4.1

is assumed to have a “global” data model, query language, and format for com-

municating data (results to queries). Component information systems may each

differ in their choices of such structural factors.

A model management repository is part of the data integration architecture.

It stores “proxies”, copies of each schema in an information system in the infras-

tructure, as a first-class object subject to manipulation in the repository. These

proxy schemata are of course expressed in the global data model1 used in the

1

In this thesis, the relational data model will occupy this role.

71

72 CHAPTER 4. REFERENCE ARCHITECTURE

**Repository Information systems
**

Editor

n

tio

sla

n

schema

Tra

data

Schemata Query

Rewriting Mediator

Proxy Proxy Query Facility

relational

schema Phys. Plan Mediator

Generation Proxy Query Facility

**Query Plan Mediator
**

Proxy

Mappings Execution Query Facility

Repository Mediator

Figure 4.1: Reference Architecture

**repository. Mappings (as sets of symmetric inter-schema constraints) are stored
**

in the repository and accessed by the a data integration reasoning services (which

will be referred to as the mediator in the tradition of [JLVV00]). The reasoning

services are assumed to have been implemented only once, “globally”, for the

“global” data model and query language. Locally, inside the information sys-

tems, there are mediator “proxies”, which accept queries using the local query

language, relative to the schema over the local data model, but delegate their

answering, after translation to global data model and query language, to the me-

diator. Mediated queries can be issued either inside an information system using

the local data model and query language or directly against the global mediator.

The most common vehicles of structural integration used throughout the data

integration approaches of Chapter 3 are wrappers [GMPQ+ 97, RS97, GK94]. The

use of wrappers is appropriate for the structural integration of (legacy) informa-

tion systems that act as sources to some global information system only. The

metaphor of wrappers is insufficient in architectures with several heterogeneous

information systems that each may need access to integrated data. We propose

a different (and bi-directional) mechanism for structural integration, which may

be conceptualized in analogy with the cell membranes of living organisms. In our

context, heterogeneous information systems each are enclosed by some transla-

tion membrane, which transforms incoming queries and data from the global data

model and query language to the local one, and does the opposite for outgoing

queries, data, and schema information2 . If the structural design choices of some

component information system have been the same as those of the global data in-

tegration infrastructure, such a membrane is of course not needed. In the case of

component information systems that do not need to access integrated data from

other information systems, one may revert to the simpler wrapping approach.

2

Information that may be on its way into the model management repository.

4.2. MEDIATING A QUERY 73

4.2 Mediating a Query

In general, queries are answered as follows. Initially, a query Q is issued against

one of the mediator proxies inside a component information system IS. This

query is then sent to an instance of the mediator. When crossing the boundary

of IS, Q is translated into a query Q′ in the “global” query language over the

proxy schema of IS, which is a citizen of the model management repository.

The mediator first rewrites Q′ into a query Q′′ over source predicates only,

using schema information and inter-schema constraints from the repository. This

query is then decomposed into an executable (distributed) query plan, which may

be optimized using cost-based metrics and special evaluation techniques known

from the distributed database field [MY95, OV99, Ull89]. To execute Q′′ , the

queries over component databases specified in the distributed query plan are sent

off to the individual information systems containing those databases.

While traversing the translation membrane surrounding component informa-

tion systems, the queries Qi are translated into queries Q′i over the local query

languages and modified to use the schemata over the local data models. These

queries are then passed on to the local query facilities, which execute them and

return data in formats relative to the local data models.

On the way back “out” of the component information systems and to the me-

diator, the data are translated to correspond to their schemata over the “global”

data model and are passed on to the mediator. There the data are combined

into one consistent result for Q′′ . This is then passed on to IS. On the way

through the component information system’s membrane, the result is reformu-

lated to adhere to Q and to the local data model of that component information

system.

4.3 Research Issues

The following chapters will address the two main voids of our proposed approach

left, which are query rewriting and the management of mappings under change.

Although much of what we have discussed in this chapter relates to structural

integration (this was done to have had it covered, such that we can subsequently

focus on semantic integration), the problems related to it have been seen before

and are sufficiently well understood [GMPQ+ 97, RS97, GK94]. Similarly, dis-

tributed query execution is quite well understood once a logical query plan exists

[MY95, OV99, Ull89].

Data integration encompasses various aspects of data reconciliation that we

will, as simplifying assumptions, assume to be implicit in the query rewriting

problem or simply excluded from consideration. For instance, object identification

[JLVV00] is the issue of matching objects from different databases which may be

identified by keys from distinct domains, or which may have no keys at all. This

74 CHAPTER 4. REFERENCE ARCHITECTURE

**problem has spurred some research of its own (e.g. [ZHKF95b]), but to a degree
**

such problems may be dealt with in our framework, as shown in the example in

Section 1.3. Another argument in favor of this stance also applies to a related data

reconciliation problem, data cleaning [JLVV00]. In fact, much of the intricacies

of these problems are related to mismatching erroneous data, inconsistencies that

often arise in the context of manually-acquired data. However, in our high-energy

physics use case of Section 1.3, for example, such data are rare. Data are usually

also well-identified by cleanly thought-through domains of identifiers.

The rationale behind the first main contribution, query rewriting with sym-

metric inter-schema constraints (Chapter 5), on the other hand, is the following.

Expressive constraints are required for two reasons.

**• The need to deal with concept mismatch which results from schemata being
**

integrated against others that have not been conceived for data integration,

and which may be a consequence of schema evolution.

**• The need for flexibility that allows to anticipate future change of schemata
**

and requirements in the design of mappings. This includes the need for ex-

pressiveness that allows to prepare mappings for the merging of schemata,

and to emulate local-as-view integration even when sources cannot be de-

clared as views over the logical entities of the integration schemata.

**The information infrastructure that has been outlined in this chapter can be
**

seen from a federated database perspective. There are several databases that

have design autonomy for their schemata (as well as for data models and query

languages), and each need to share data. As is well known for federated databases,

the lack of a “global” schema for data integration leads to the uncomfortable

situation that given N schemata, N 2 mappings between them need to be created

and managed. Given our requirement that schemata and integration requirements

may change, it is clear that the management task is difficult.

A surprising breakthrough on the management front is not to be expected.

Similar issues have been studied in various contexts by a large number of re-

searchers in the fields of software engineering, database (schema) design, and on-

tological engineering. The solutions that have been developed all center around

common ideas: the treatment of the artifacts to be managed as first-class cit-

izens on which clearly defined and powerful operations are developed that can

be used to manipulate them with the greatest possible amount of automation

and computer support, as well as the use of design patterns, best practices, and

design heuristics. We thus propose exactly such a solution, a model management

approach in combination with a methodology for managing mappings and their

change (Chapter 6).

Chapter 5

**Query Rewriting with Symmetric
**

Constraints

5.1 Outline

In this chapter, we address the query rewriting problem of data integration in a

very general setting. To start somewhere, we take the common approach of re-

stricting ourselves to the relational data model and conjunctive queries. We drop

the assumption of the existence of a single coherent global integration model over

which queries may be asked, which are then rewritten into queries in terms of

source predicates. Given a conjunctive (or positive) relational query over (possi-

bly) both virtual and source predicates, we attempt to find a maximally contained

rewriting in terms of only source predicates under a given appropriate semantics

and a set of constraints, and the positive queries as a query language (i.e., the

output is a set of conjunctive queries). We support symmetric constraints in the

form of what we call Conjunctive Inclusion Dependencies (cind’s), containment

relationships between conjunctive queries.

We propose two alternative justifiable semantics, the classical logical and a

straightforward rewrite systems semantics 1 . Under both, the problem is a proper

generalization of the local-as-view as well as the global-as-view approaches.

In many real-life situations where neither source relations can be defined as

views over a given set of virtual relations nor a virtual relation as a view over a

number of sources, a satisfactory containment relationship between conjunctive

queries can be formulated using cind’s. Apart from that, our type of constraints

allows to map schemata in a model management context using a clean and expres-

sive semantics or to “patch” local-as-view or global-as-view integration systems

1

Informally speaking, the intuition of this second semantics is that given a conjunctive query

Q, a subexpression E of Q, and a cind Q1 ⊇ Q2 , if we can produce a contained rewriting under

the semantics of the problem of answering queries using views where we take E as query and

Q1 as logical view, we can replace (while applying the respective variable mappings) E in Q by

Q2 to produce a rewriting that is again “contained” in Q.

75

76 CHAPTER 5. QUERY REWRITING

**when sources need to be integrated whose particularities have not been foreseen
**

when designing the integration schemata. The problem may also be relevant for

maintaining physical data independence under schema evolution (see Section 7.1).

Unfortunately, (as is immediately clear for the classical semantics), such pos-

itive rewritings may be infinite and the major decision problems (such as the

nonemptiness or boundedness of the result) are undecidable. However, given

that the predicate dependency graph (with respect to the inclusion direction) of

a set of constraints is acyclic, we can guarantee to find the maximally contained

rewritings under both semantics, which are finite. We will argue that for ob-

taining maximally contained rewritings in the data integration context, we can

require the constraints to be acyclic without much inconvenience; rather, it may

even be desirable.

As contributions of this chapter, we first provide characterizations of both

semantics as well as algorithms which, given a conjunctive query, enumerate the

maximally contained rewritings. We discuss various relevant aspects of query

rewriting in our context, such as the minimality and nonredundancy of conjunc-

tive queries in the rewritings. Next we compare the two semantics and argue that

the second is more intuitive and may fit better the expectations of human users

of data integration systems than the first. Following the philosophy of that se-

mantics, rewritings can be computed by making use of database techniques such

as query optimization and ideas from e.g. algorithms developed for the problem

of answering queries using views. We believe that in a practical information

integration context there are certain regularities (such as sets of predicates –

schemata – from which predicates are used together in queries, while there are

few queries that combine predicates from several schemata) that render query

rewriting following the intuitions of the second semantics more efficient in prac-

tice. Surprisingly, however, it can be shown that the two semantics coincide. We

then present a scalable algorithm for the rewrite systems semantics (based on

previous work such as [PL00]), which we have implemented in a practical system,

CindRew. We evaluate it experimentally against other algorithms for the same

and for the classical logical semantics. It turns out that our implementation,

which we make available for download, scales to thousands of constraints and

realistic applications. We conclude with a discussion of how our query rewriting

approach fits into state-of-the-art data integration systems.

5.2 Preliminaries

We define a conjunctive inclusion dependency (cind) as a constraint of the form

Q1 ⊆ Q2 where Q1 , Q2 are conjunctive queries (without arithmetic comparisons,

but possibly with constants) of the form

**{hx1 , . . . , xn i | ∃xn+1 . . . xm : (p1 (X̄1 ) ∧ . . . ∧ pk (X̄k ))}
**

5.2. PRELIMINARIES 77

**with a set of distinct 2 unbound variables x1 , . . . , xn . We may write {Q1 ≡ Q2 }
**

as a short form of {Q1 ⊆ Q2 , Q1 ⊇ Q2 }.

The normalization of a set Σ of cind’s is a set of Horn clauses, the set of

cind’s taken as a logical formula transformed into (implication) normal form.

These Horn clauses are of a simple pattern. Every cind σ of the form Q1 ⊆ Q2

with

**Q1 = {hx1 , . . . , xn i | ∃xn+1 . . . xm : v1 (X̄1 ) ∧ . . . ∧ vk (X̄k )}
**

Q2 = {hy1, . . . , yn i | ∃yn+1 . . . ym′ : p1 (Ȳ1 ) ∧ . . . ∧ pk′ (Ȳk′ )}

**translates to k ′ Horn clauses pi (Z̄i ) ← v1 (X̄1 ) ∧ . . . ∧ vk (X̄k )). where each zi,j
**

of Z̄i is determined as follows: If zi,j is a variable yh with 1 ≤ h ≤ n, replace it

with xh . If zi,j is a variable yh with n < h ≤ m′ , replace it with Skolem function

fσ,yh (x1 , . . . , xn ) (the subscript assures that the Skolem functions are unique for

a given constraint and variable).

**Example 5.2.1 The normalization of the cind
**

σ : {hy1 , y2 i | ∃y3 : p1 (y1 , y3) ∧ p2 (y3 , y2 )}

⊇ {hx1 , x2 i | ∃x3 : v1 (x1 , x2 ) ∧ v2 (x1 , x3 )}

is

p1 (x1 , fσ,y3 (x1 , x2 )) ← v1 (x1 , x2 ) ∧ v2 (x1 , x3 ).

p2 (fσ,y3 (x1 , x2 ), x2 ) ← v1 (x1 , x2 ) ∧ v2 (x1 , x3 ).

**Whenever a cind translates into a function-free clause in normal form, we will
**

write it in datalog notation. This is the case for cind’s of the form

{hX̄i | p(X̄)} ⊇ Q

**i.e. the subsumer query is a ∃-free single-literal query.
**

The dependency graph of a set C of Horn clauses is the directed graph con-

structed by taking the predicates of C as nodes and adding, for each clause in C,

an edge from each of the body predicates to the head predicate. The diameter of

a directed acyclic graph is the longest directed path occurring in it. The depen-

dency graph of a set of cind’s is the dependency graph of its normalization. A set

of cind’s is cyclic if its dependency graph is cyclic. An acyclic set Σ of cind’s is

called layered if the predicates appearing in Σ can be partitioned into n disjoint

sets P1 , . . . , Pn s.t. there is an index i for each cind σ : Q1 ⊆ Q2 ∈ Σ such that

P reds(Body(Q1)) ⊆ Pi and P reds(Body(Q2)) ⊆ Pi+1 and Sources = P1 .

The problem that we want to address in this chapter is the following:

2

Note that if we would not require unbound variables in constituent queries to be distinct,

the transformation into normal form would result in Horn clauses with equality atoms as heads.

78 CHAPTER 5. QUERY REWRITING

**Definition 5.2.2 (Query rewriting under symmetric constraints.) Given disjoint
**

sets of so-called “source” (materialized) and “virtual” predicates, a conjunctive

(or positive) query Q over possibly both sources and virtual predicates, and a

set Σ of cind’s, find the maximally contained positive query Q′ exclusively over

source predicates under a given semantics.

**Later in this chapter we will discuss two such semantics for this problem. The
**

maximally contained rewritings under these semantics will be defined analogously

to the case of the problem of answering queries using views. Note that we do not

require that the input query Q only contains virtual predicates; furthermore, we

do not by default have any special restrictions regarding a set of cind’s Σ, apart

from the following. Without loss of generality, and for simplicity, we assume that

no source predicates appear in any heads of Horn clauses created by normaliza-

tion of the cind’s. (We can always replace a source predicate that violates this

assumption by a new virtual predicate in all cind’s and then add a cind that

maps the source predicate to that new virtual predicate.)

5.3 Semantics

We discuss two alternative semantics for query rewriting, first the classical logical

and later a straightforward rewrite systems semantics.

**5.3.1 The Classical Semantics
**

Let us begin with a straightforward remark on the containment problem for

conjunctive queries under a set of cind’s, which, since they are themselves con-

tainment relationships between conjunctive queries, is the implication problem

for this type of constraint. If we want to check a containment

{hX̄i | ∃Ȳ : φ(X̄, Ȳ )} ⊇ {hX̄i | ∃Z̄ : ψ(X̄, Z̄)}

**of two conjunctive queries under a set Σ of cind’s by refutation (without loss of
**

generality, we assume Ȳ and Z̄ to be disjoint and the unbound variables in the

two queries above to be the same3 , X̄), we have to show

Σ, ¬(∀X̄ : (∃Ȳ : φ(X̄, Ȳ )) ← (∃Z̄ : ψ(X̄, Z̄))) ⊥

**i.e. the inconsistency of the constraints and the negation of the containment taken
**

together. In normal form, ψ becomes a set of ground facts where all variables

3

In the remainder of this chapter, we will implicitly – whenever we do not sacrifice clarity by

this – assume that variables from different clauses are distinct, or in different “name spaces”,

even if several instances of the same clause interfere with each other during unification or

unfolding and that new variables are automatically introduced where necessary to assure this.

5.3. SEMANTICS 79

**have been replaced one-to-one by new constants and φ becomes a clause with
**

an empty head, where all distinguished variables xi have been replaced by the

constants also used for ψ.

Example 5.3.1 For proving the containment

{hx1 , x2 i | ∃x3 : (p1 (x1 , x3 ) ∧ p2 (x3 , x2 ))} ⊇

{hy1 , y2 i | ∃y3 : (r1 (y1, y3 ) ∧ r2 (y3 , y2 ))}

we have to translate it into

← p1 (α1 , x3 ) ∧ p2 (x3 , α2 ).

r1 (α1 , α3 ) ← . r2 (α3 , α2 ) ← .

where α1 , α2 , α3 are constants not appearing elsewhere.

**We have now transformed our original problem into a set of equivalent Horn
**

clauses, and can treat it as a logic program. We can take the single clause with

the empty head above (the body of the subsumer query) and use it as a goal for

refutation.

**Definition 5.3.2 Under the classical semantics, a maximally contained rewrit-
**

ing of a conjunctive query Q is equivalent to the set of all conjunctive queries Q′

over source predicates for which Σ Q′ ⊆ Q.

**We can obtain such a maximally contained rewriting in the following way.
**

Given a conjunctive query Q4 and the normalization C of a set of cind’s, we add

a unit clause (with a tuple of distinct variables) for each source atom5 . Then we

try to refute the body of Q. (Differently from what we do for containment, we do

not freeze any variables.) If we have found a refutation with a most general unifier

θ, we collect the unit clauses used and create a Horn clause with θ(Head(Q)) as

head and the application of θ to the instances of unit clauses involved in the proof

as body. If this clause is function-free, we output it; after that we go on as if we

had not found a “proof” to compute more rewritings. Given e.g. a breath-first

strategy, it is easy to see that this method will compute a maximally contained

rewriting of Q in terms of multisets of conjunctive queries in the sense that for

each conjunctive query contained in Q, a subsumer will eventually be produced.

See Example 5.3.10 for query rewriting by an altered refutation proof.

Equivalent rewritings can be computed by interleaving the computation of

contained rewritings with the verification if Q is contained in any of the already

computed rewritings.

4

The results in this chapter generalize to positive input queries in a straightforward manner.

5

We still assume that source predicates do not appear in any heads in C.

80 CHAPTER 5. QUERY REWRITING

**Unfortunately, since we allow for arbitrary conjunctive queries as subsumees
**

in cind’s, we cannot make any guarantees regarding the minimality or nonredun-

dancy of rewritings. While it is of course possible to minimize conjunctive queries

when they are produced, it is impractical to require that the result be nonredun-

dant. It can for instance easily be seen that we can encode arbitrary (recursive)

datalog programs as sets of cind’s. Query rewriting may then produce an infinite

result, and the boundedness problem (that is, telling whether the result will be

finite) is undecidable. Thus, if an incomplete result is acceptable in such cases,

it is more appropriate to output rewritings as soon as they are found, and not to

eliminate redundancies.

We next present an alternative algorithm for computing maximally contained

rewritings which proceeds in a bottom-up fashion. The intuition of this procedure

can be used to unfold constraints early on where appropriate, which may allow

us to avoid recomputing certain intermediate results many times. It also only

needs a restricted kind of unification that we want to look at in more detail.

**Algorithm 5.3.3 (Bottom-up query rewriting).
**

Input: The normalization C of a set of cind’s that do not contain source predi-

cates in the subsuming query, a conjunctive query Q, and a set of source predi-

cates S.

Output: A (multi-)set of conjunctive queries X exclusively over source predi-

cates.

**X := {c ∈ C | P reds(Body(c)) ⊆ S};
**

C := C\X;

forever {

choose some clause c ∈ C ∪ {Q};

let n = |Body(c)|; θ := ∅;

choose some tuple hc1 , . . . , cn i with c1 , . . . , cn ∈ X ∪ {ǫ},

(ci = ǫ) iff P red(Bodyi(c)) ∈ S;

for each 1 ≤ i ≤ n with ci 6= ǫ do

θ := unify(θ, Body(c), Head(ci));

if θ 6= f ail then {

c′ := unfold(c, θ, hc1, . . . , cn i);

if (c 6= Q) ∧ (Body(c′) is function-free) then

X := X ∪ {c′ };

else if (c = Q)∧ (c′ is function-free) then

print c′ ;

}

if no new query or clause for X can be found then

exit;

}

5.3. SEMANTICS 81

**We will now have a closer look at the functions “unify” and “unfold” which we
**

have used above and that we will meet again in that form later. “unify” takes a

most general unifier θ and two atoms a and b and produces a most general unifier

θ′ of a and b which is consistent with θ in the usual way, if one exists. Otherwise

the function returns f ail (and we assume the variables in the two atoms to be

from two distinct name spaces). We assume that a is always from the body of

a clause “higher up” and that b is the head of a clause whose body is to replace

that former atom.

Unification here is simpler than in general because we have the following

restrictions: (1) Body(c) is always function-free, which simplifies the implemen-

tation of unification. (2) Since the body of each valid query must be function-free

and once a clause contains a function term in its body, it cannot recover from

that state, we can exclude the possibility that a function term from Head(ci) gets

unified with a variable from Head(cj ). For the same reason, we can disallow that

function terms get unified with variables that appear in atoms a ∈ Body(a) where

P red(a) is a source. Secondly, when two function terms from Head(ci), Head(cj )

get unified with the same variable in c, they must be equal except for variable

renamings, because otherwise again subterms would get unified with variables

from some ck . (3) If a variable from c gets unified with a function term, it cannot

unify with any other variable. (4) If c is a query to be rewritten, we can block

all variables in Head(c) early on from being unified with function terms, as this

could again not lead to a function-free rewriting.

The function “unfold” accepts a Horn clause c with |Body(c)| = n, a unifier θ

and a tuple of n Horn clauses or ǫ s.t. if ci 6= ǫ, θ unifies Bodyi(c) with Head(ci ).

It produces a new Horn Clause from c by replacing each of its non-source body

atoms Bodyi (c), if ci 6= ǫ, by θ(Body(ci)). (i.e. after applying substitutions from

the unifier). If ci = ǫ, Bodyi (c′ ) = θ(Bodyi (c)).

If the clauses c1 , . . . , cn are from the normalization of a set of cind’s rather

than the unfolding of constraints (as produced by Algorithm 5.3.3), we may

avoid producing redundancies in the result by not including substituted bodies if

already another body from the same cind was included and this occurred under

the same substitution of all distinguished variables of that cind. A special case6

that is particularly easy to implement is when a variable of c has been unified

with a function term. In that case, only one body atom that contains this variable

needs to be substituted, all others can be dropped. This is the case because the

normalization of a cind will only produce function terms that contain all the

distinguished variables of the cind in a uniform manner. Therefore, when the

unification of a variable from c with two function terms with the same function

symbol succeeds, all the variables in a pair of function terms unified with the

6

This case is analogous to a technique that is part of the MiniCon algorithm [PL00], and

which allows to restrict oneself to including a view in a rewriting only once for each application

of a MiniCon description.

82 CHAPTER 5. QUERY REWRITING

same variable of c have been pairwise unified themselves.

**5.3.2 The Rewrite Systems Semantics
**

The rewrite systems semantics is best defined using the notion of MiniCon de-

scriptions (see Definition 3.6.3). We adapt this notion to our framework based

on rewriting with Horn clauses.

**Definition 5.3.4 (Inverse MiniCon Description). Let Q be a conjunctive query
**

with n = |Body(Q)| and C be the normalization of a set of cind’s. An (inverse)

MiniCon description for Q is a tuple hc1 , . . . , cn i ∈ (C ∪ {ǫ})n that satisfies the

following two conditions. (1) For the most general unifier θ 6= f ail arrived at by

unifying all the ci 6= ǫ with Bodyi (Q), the unfolding of Q and hc1 , . . . , cn i under

θ is function-free and (2) there is no tuple hc′1 , . . . , c′n i ∈ {c1 , ǫ} × . . . × {cn , ǫ}

with fewer entries different from ǫ than in hc1 , . . . , cn i, such that the unfolding of

Q with hc′1 , . . . , c′n i is function free.

**Note that the inverse MiniCon descriptions of Definition 5.3.4 exactly coincide
**

with the MCDs of Definition 3.6.3. The algorithm for computing maximally

contained rewritings shown below can easily be reformulated so as to use the

standard MCDs of [PL00]. That way, one can even escape the need to transform

cind’s into Horn clauses and can reason completely without the introduction of

function terms. However, to support the presentation of our results (particularly

the equivalence proof of the following section), we do not follow this path in this

chapter.

Maximally contained rewritings of a conjunctive query Q are now computed by

iteratively unfolding queries with single MiniCon descriptions7 until a rewriting

contains only source predicates in its body.

**Algorithm 5.3.5 (Query rewriting with MCDs).
**

Input. A conjunctive query Q, the normalization C of a set of cind’s, and a set

S of source predicates

Output. A maximally contained rewriting of Q

Qs := [Q];

while Qs is not empty do {

[Q, Qs] := Qs;

if P reds(Q) ⊆ S then output Q;

else {

M := compute inverse MCDs for Q, C;

for each hc1 , . . . , cn i ∈ M do {

7

In this respect, the rewrite systems semantics differs from the MiniCon algorithm for the

problem of answering queries using views.

5.3. SEMANTICS 83

θ := ∅;

for each 1 ≤ i ≤ n do

θ := unify(θ, Bodyi (Q), ci );

′

Q := unfold(Q, θ, hc1 , . . . , cn i);

Qs := [Qs, Q′ ];

} } }

“unify” is the restricted kind of unification that we discussed in the previous

section, with the additional constraint that now all function terms are of depth

one (that is, there are no function terms that have function terms as subterms).

Definition 5.3.6 (Rewrite Systems Semantics). Let Q be a conjunctive query,

S a set of source predicates, and Σ a set of cind’s. Then, Algorithm 5.3.5 computes

the maximally contained positive rewriting of Q under Σ in terms of S under the

rewrite systems semantics.

Example 5.3.7 (“Coffee Can Problem” [DJ90]) Consider the rewrite system

black white → black white black → black black black → white

with symbols “white” and “black” and the input word

w = (white white black black white white black black)

where the goal is to replace sequences of symbols of that word that match the

left hand side of one of the three productions listed above repeatedly to produce

a rewriting that is as small as possible. One such sequence of replacements is

(0) white white black black white white black black

(1) white white black black white black black

(2) white white white white black black

(3) white white white black black

(4) white white black black

(5) white black black

(6) black black

(7) white

Pairs of occurrences of the symbols “black” or “white” have been underlined

immediately before their replacement. Thus, the input string can be rewritten

into a word with a single symbol, “white”.

We can simulate such behavior using query rewriting under the rewrite sys-

tems semantics. Let us search for one-symbol rewritings. We model an n-symbol

word w ∈ {black, white}n as a query of the form

q(x1 ) ← start end(x1 , xn+1 ), p1 (x1 , x2 ), . . . , pi (xi , xi+1 ), . . . , pn (xn , xn+1 ).

where pi is either “black” or “white” and x1 . . . xn1 are variables. The above input

word is thus represented as

84 CHAPTER 5. QUERY REWRITING

q(x1 ) ← start end(x1 , x9 ),

white(x1 , x2 ), white(x2 , x3 ), black(x3 , x4 ), black(x4 , x5 ),

white(x5 , x6 ), white(x6 , x7 ), black(x7 , x8 ), black(x8 , x9 ).

**The rewrite system can be encoded as a set of cind’s
**

{hx, yi | ∃z : black(x, z) ∧ white(z, y)} ⊇ {hx, yi | black(x, y)}

{hx, yi | ∃z : white(x, z) ∧ black(z, y)} ⊇ {hx, yi | black(x, y)} (⋆)

{hx, yi | ∃z : black(x, z) ∧ black(z, y)} ⊇ {hx, yi | white(x, y)}

Furthermore, we define two source predicates w src and b src and define cind’s

responsible for making the rewrite process terminate with “success” (i.e., a con-

tained rewriting in terms of the source predicates is found).

{hxi | ∃y : start end(x, y) ∧ white(x, y)} ⊇ {hxi | w src(x)}

{hxi | ∃y : start end(x, y) ∧ black(x, y)} ⊇ {hxi | b src(x)}

It can be verified by applying the above algorithm (although this is a quite work-

intensive task) that the maximally contained rewriting under the rewrite systems

semantics is

q(x1 ) ← w src(x1 ).

**In fact, the seven-step sequence of replacements shown above can be easily
**

used to create a proof in our rewrite systems semantics that q ′ is in the maxi-

mally contained rewriting. For the first replacement of that sequence, the tuple

hc1 , . . . , cn i ∈ (C ∪ {ǫ})n of Algorithm 5.3.5 would equal hǫ, ǫ, ǫ, cσ2 ,1 , cσ2 ,2 , ǫ, ǫ, ǫi

where cσ2 ,1 and cσ2 ,2 are the first and second Horn clause created by normalizing

our second cind (⋆). We can conclude that the above rewrite system cannot result

in a one-symbol rewriting “black” for the given input word.

**5.3.3 Equivalence of the two Semantics
**

Theorem 5.3.8 Let Q be a conjunctive query, Σ be a set of cind’s, and S be a

set of “source” predicates. Then, the maximally contained rewriting under the

classical logical semantics and Σ in terms of S and its analog under the rewrite

systems semantics coincide.

For showing this, we first establish the following auxiliary result.

**Lemma 5.3.9 Let P be a resolution proof establishing a logically contained
**

rewriting of a conjunctive query Q under a set of cind’s Σ. Then, there is always

a proof P ′ establishing the same contained rewriting such that each intermediate

rewriting is function-free.

5.3. SEMANTICS 85

**Proof. Let us assume that each new subgoal a derived using resolution receives
**

an identifying index idx(a). Then, given the proof P, there is a unique next

premise to be applied cidx(a) out of the Horn clauses in the normalization of Σ for

each subgoal a. This is the Horn clause from our constraints base that will be

unfolded with a to resolve it in P.

Note that the proof P is fully described by the indexes of subgoals in the (body

of the) original query Q, some unique indexing of subgoals somewhere created

later on in the proof (while we do not need to know the atoms themselves), the

clauses cidx(a) , and which indexes the subgoals in the bodies of these clauses are

attributed with when they are unfolded with subgoals.

In our original proof P, each subgoal a of a goal is rewritten with cidx(a) in

each step, transforming g0 , the body of Q and the initial goal, via g1 , . . . , gn−1 to

gn , the body of the resulting rewriting. We maintain the head of Q separately

across resolution steps and require that variables in the head are not unified with

function terms, but apply other unifications effected on the variables in the goals

in parallel with the rewriting process. Already P must assure at any step that

no variable from the head of Q is unified with a function term, as otherwise no

conjunctive query can result.

We know that resolution remains correct no matter in which order the next

due resolution steps cidx(a) are applied to the subgoals, and that we even may

unfold, given e.g. a goal with two atoms, the first goal and then a subgoal from

the unfolding of that first goal (and may do that any finite number of times)

before we unfold our second original subgoal.

Coming back to deriving a function-free proof starting from P, all we now

have to show is that at any intermediate step of a resolution proof with cind’s, a

nonempty set of subgoals S = {ai1 , . . . , aik } ⊆ gi of the function-free intermediate

goal gi exists such that, when only these subgoals are unfolded with their next due

premises to be applied cidx(ai1 ) , . . . , cidx(aik ) , the overall new goal gi+1 produced

will be function-free8 . The emphasis here lies on finding a nonempty such set

S, as the empty set automatically satisfies this condition. If we can guarantee

that such a nonempty set always exists until the function-free proof has been

completed, our lemma is shown.

Let there be a dependency graph Ggi = hV, Ei for each intermediate goal

gi with the subgoals as vertices and a directed edge ha, bi ∈ E iff a contains a

variable v that is unified with a function term f (X̄) in Head(cidx(a) ) and v appears

in b and is unified with a variable (rather than a function term with the same

function symbol) in Head(cidx(b) ). (Intuitively, if there is an edge ha, bi ∈ E, then

b must be resolved before a if a proof shall be obtained in which all intermediate

goals are function-free.) As mentioned, query heads are guaranteed to remain

function-free by the correctness of P. For instance, the dependency graph of the

8

The correctness of the proof P alone assures that the query head will be function-free as

well.

86 CHAPTER 5. QUERY REWRITING

goal

← a(x)(0) , b(x, y)(1) , c(y, z)(2) , d(z, w)(3) .

with

c0 : a(x) ← a′ (x). c1 : b(f (x), x) ← b′ (x).

c2 : c(x, x) ← c′ (x). c3 : d(g(x), x) ← d′ (x).

would be G = h{0, 1, 2, 3}, {h0, 1i, h2, 3i}i.

We can now show that such a dependency graph G is always acyclic. In fact,

if it were not, P could not be a valid proof, because unification would fail when

trying to unify a variable in such a cycle with a function term that contains that

variable. This is easy to see because each function term given our construction

used for obtaining Horn clauses from cinds contains all variables appearing in

that same (head) atom. Consider for instance

q(x) ← a(x, y), a(y, z), b(w, z), b(z, y).

{hx, yi | ∃z : a(x, z) ∧ a(z, y)} ⊇ {hx, yi | b(x, y)}

{hx, yi | ∃z : b(x, z) ∧ b(z, y)} ⊇ {hx, yi | a(x, y)}

**There is no rewriting under our two semantics, because the dependency graph of
**

our above construction is cyclic already for our initial goal, the body of q.

However, since G is acyclic, we can unfold a nonempty set of atoms (those

unreachable from other subgoals in graph G) with our intermediate goals until

the proof has been completed.

Proof of Theorem 5.3.8. It is easy to see that the rewriting process for

finding maximally contained rewritings under the rewrite systems semantics is

equivalent to resolution where only some of the subgoals of a goal may be rewrit-

ten in a single step and each intermediate rewriting has to be function-free.

Assume that a proof establishing a single contained conjunctive query is

known for the rewrite systems semantics. Then, this is also a proof for the

classical semantics, and inclusion in this direction is shown.

The other direction follows from Lemma 5.3.9. Given a resolution proof P

that a conjunctive query Q′ is a contained rewriting of Q, we can always construct

an analogous proof of this from P for the rewrite systems semantics.

From this equivalence of resolution proofs and proofs with function-free inter-

mediate steps we conclude that the overall search process for maximally contained

rewritings under both semantics is guaranteed to lead to equal results.

**Example 5.3.10 Given a boolean conjunctive query q ← b(x, x, 0). and the
**

following set of Horn clauses which, as is easy to see, are equivalent to and the

normalization of a set of cind’s that we do not show in order to reduce redundancy.

5.3. SEMANTICS 87

b(x′ , y ′, s0 ) ← a(x, y, s2 ) ∧ eǫ (x, x′ ) ∧ e1 (y, y ′). c0

b(x′ , y ′, s2 ) ← a(x, y, s0 ) ∧ e1 (x, x′ ) ∧ e0 (y, y ′). c4 , c10 , c11

b(x′ , y ′, s0 ) ← a(x, y, s1 ) ∧ e0 (x, x′ ) ∧ eǫ (y, y ′). c12 , c18 , c19

b(x′ , y ′, s1 ) ← a(x, y, s0 ) ∧ e1 (x, x′ ) ∧ e1 (y, y ′). c20 , c25

eǫ (x, x) ← v(x). c2 , c17

e1 (x, f1 (x)) ← v(x). c3 , c8 , c23 , c24

e0 (x, f0 (x)) ← v(x). c2 , c17

v(x) ← b(x, y, s). c5 , c13 , c21

v(y) ← b(x, y, s). c6 , c14

a(x, y, s) ← b(x, y, s). c1 , c7 , c15

where x, y, x′ , y ′ are variables and s0 , s1 , s2 are constants. Let P be the resolution

proof

(0) ← b(x, x, 0)(0) .

(1) ← a(x, y, 2)(1) , eǫ (x, z)(2) , e1 (y, z)(3) .

(2) ← b(f1 (y), y, 2)(4), v(f1 (y))(5) , v(y)(6) .

(3) ← a(x1 , y1 , 0)(7) , e1 (x1 , f1 (y))(8) , e0 (y1 , y)(9) ,

b(f1 (y), v1, 2)(10) , b(v2 , y, 2)(11). †10 , †11

(4) ← b(f0 (y1 ), y1 , 0)(12) , v(f0 (y1 ))(13) , v(y1 )(14) .

(5) ← a(x2 , y2 , 1)(15) , e0 (x2 , f0 (y1 ))(16) ,

eǫ (y2 , y1)(17) , b(f0 (y1 ), v1 , 0)(18) ,

b(v2 , y1 , 0)(19) . †18 , †19

(6) ← b(y1 , y1 , 1)(20) , v(y1)(21) .

(7) ← a(x, x, 0)(22) , e1 (x, f1 (x))(23) ,

e1 (x, f1 (x))(24) , b(y1 , v1 , 1)(25) . †25

(8) ← a(x, x, 0)(22) , v(x)(26) .

**which rewrites our query into q ← a(x, x, 0), v(x). and in which we have su-
**

perscribed each subgoal with its assigned index. To keep things short, we have

eliminated subgoals (marked with a dagger † and their index) that are redundant

with a different branch of the proof. As claimed in our theorem, P can be trans-

formed into the following proof in which each intermediate step is function-free.

(0) ← b(x, x, 0)(0) .

(1) ← a(x, y, 2)(1) , eǫ (x, z)(2) , [e1 (y, z)(3) ].

(2) ← b(x, y, 2)(4) , v(x)(5) , [e1 (y, x)(3) ].

(3) ← a(x1 , y1 , 0)(7) , e1 (x1 , x)(8) , e0 (y1 , y)(9) ,

b(x, v1 , 2)(10) , [e1 (y, x)(3) ]. †10

(4) ← a(x1 , y1 , 0)(7) , e1 (x1 , x)(8) ,

[e0 (y1 , y)(9) ], 6 [e1 (y, x)(3)6 ].

(5) ← b(y, y1, 0)(12) , v(y)(14) , [e0 (y1 , y)(9) ].

88 CHAPTER 5. QUERY REWRITING

(6) ← a(x2 , y2 , 1)(15) , e0 (x2 , y)(16) , eǫ (y2 , y1 )(17) ,

b(y, v1 , 0)(18) , 6 [e0 (y1, y)(9)6 ]. †18

(7) ← b(y1 , y1 , 1)(20) , v(y1)(21) .

(8) ← a(x3 , y3 , 0)(22) , e1 (x3 , y1)(23) ,

e1 (y3, y1 )(24) , b(y1 , v1 , 1)(25) . †25

(9) ← a(x3 , x3 , 0)(22) , v(x3 )(26) .

**The subgoals that we have marked with brackets [ ] had been blocked at a certain
**

step to keep the proof function-free.

**Of course this correspondence between function-free and general resolution
**

proofs does not hold for Horn clauses in general.

Example 5.3.11 Consider the boolean query

q ← a1 (u, v), b1(u, v).

and the Horn clauses

a1 (f (x), y) ← a2 (x, y). a2 (x, g(y)) ← a3 (x, y).

b1 (x, g(y)) ← b2 (x, y). b2 (f (x), y) ← b3 (x, y).

These entail

q ← a3 (x, y), b3 (x, y).

**although one cannot arrive at a function-free intermediate rewriting by either
**

unfolding the left (which would result in q ← a2 (x, y), b1 (f (x), y).) or right

subgoal (which would result in q ← a1 (x, g(y)), b2(x, y).) of our query first,

neither by unfolding both at once (resulting in q ← a2 (x, g(y)), b2(f (x), y).).

5.3.4 Computability

Theorem 5.3.12 Let Σ be a set of cind’s and Q and Q′ be conjunctive queries.

Then the following problems are undecidable:

• Σ Q ⊆ Q′ , the containment problem.

**• ∃Q′ : Σ Q ⊇ Q′ , i.e. it is undecidable whether the maximally contained
**

rewriting of a conjunctive query Q under the classical logical semantics is

nonempty (that is, it contains at least one conjunctive query)9 .

9

By Theorem 5.3.8, this is equivalent to the following problem: Given a conjunctive query

Q, is the maximally contained rewriting under the rewrite systems semantics nonempty?

5.3. SEMANTICS 89

**We also give an intuition for the undecidability results of Theorem 5.3.12.
**

Post’s Correspondence Problem (PCP, see e.g. [HU79]), a simple and well-known

undecidable problem, is defined as follows. Given nonempty words x1 , . . . , xn and

y1 , . . . , yn over the alphabet {0, 1}, the problem is to decide whether there are

indexes i1 , . . . , ik (with k > 0) s.t. xi1 xi2 . . . xik = yi1 yi2 . . . yik . In the following

example, we show, by an example, an encoding of PCP in terms of our query

rewriting problem.

In fact, Example 5.3.10 already presented an encoding for PCP that shows

the undecidability of query rewriting with cind’s10 . In the following example, we

provide another one which is simpler.

**Example 5.3.13 Given are a source s, a boolean query q ← inc(0, 0). and
**

the following five cind’s

{hx, yi | dec(x, y)} ⊆ {hx, yi | ∃x1 , y1 : zero(x, x1 ) ∧ zero(y, y1) ∧ inc(x1 , y1)} (1)

{hx, yi | dec(x, y)} ⊆ {hx, yi | ∃x1 , y1 : zero(x, x1 ) ∧ zero(y, y1) ∧ dec(x1 , y1)} (2)

{hx, yi | dec(x, y)} ⊆ {hx, yi | ∃x1 , y1 : one(x, x1 ) ∧ one(y, y1) ∧ inc(x1 , y1 )} (3)

{hx, yi | dec(x, y)} ⊆ {hx, yi | ∃x1 , y1 : one(x, x1 ) ∧ one(y, y1) ∧ dec(x1 , y1)} (4)

dec(0, 0) ← s. (5)

that constitute the core encoding and two constraints

**inc(x, y) ← one(x, x1 ), zero(x1 , x2 ), one(x2 , x3 ),
**

one(y, y1), inc(x3 , y1 ). (6)

**inc(x, y) ← one(x, x1 ), zero(y, y1), one(y1 , y2),
**

one(y2 , y3 ), one(y3, y4 ), zero(y4 , y5 ),

inc(x1 , y5 ). (7)

that stand for a PCP problem instance with two pairs of words,

I = {hx1 = 101, y1 = 1i, hx2 = 1, y2 = 01110i}

**The constraints (1) – (4) can be considered to have a role of “guessing” a solu-
**

tion to the PCP problem, constraints (6) and (7) to have a role of “checking”

the solution, and constraint (5) the role of “terminating” when the search was

successful.

10

Example 5.3.10 is an encoding of PCP with the instance

I = {hx1 = 10, y1 = 1i, hx2 = 1, y2 = 01i}

The instance itself is encoded in the first four Horn clauses only. The encoding, while more

complicated than the one presented in this section, allows to show the undecidability of query

rewriting (A PCP instance is satisfiable if and only if the maximally contained rewriting of the

query q ← b(x, x, 0). is nonempty.) as well as the undecidability of query containment under a

set of cind’s (A PCP instance is satisfiable iff {hi | ∃x : v(x) ∧ a(x, x, 0)} ⊆ {hi | ∃x : b(x, x, 0)}).

90 CHAPTER 5. QUERY REWRITING

**For showing the PCP instance satisfiable, one can compute a contained rewrit-
**

ing by applying the constraints in the following order (we only describe the

proof but no dead-end branches): (guess phase) (6), (7), (6), (check phase) (3),

(2), (4), (4), (4), (2), (4), (termination) (5). The maximally contained rewrit-

ing is nonempty because there is a solution to this particular PCP instance,

x1 x2 x1 = y1 y2 y1 = 1011101.

**5.3.5 Complexity of the Acyclic Case
**

For the important case that Σ is acyclic, the two above problems are decidable

(and NEXPTIME -complete). We first establish the following auxiliary result.

**Lemma 5.3.14 Let Σ be an acyclic set of cind’s and Q and Q′ be conjunctive
**

queries. Then the containment problem Σ Q ⊆ Q′ and the problem of deciding

whether the maximally contained rewriting of Q (as a set of conjunctive queries)

is nonempty are NEXPTIME-hard.

**Proof. NEXPTIME-hardness follows from a slightly altered form of the
**

encoding of the NEXPTIME-complete Tiling problem (see e.g. [Pap94]) used in

[DV97] to show NEXPTIME-hardness of the SUCCESS problem for nonrecursive

logic programming.

TILING is the problem of tiling the square of size 2n × 2n by tiles – squares of

size 1 × 1 – of k types. There are two binary relations on and to defined on the

tiles. Tiles ti and tj are said to be horizontally compatible if hti , tj i ∈ to holds

and are called vertically compatible if hti , tj i ∈ on. A tiling of the square of size

2n × 2n is a function f : {1, . . . , 2n } × {1, . . . , 2n } → {t1 , . . . , tk } s.t. vertically

and horizontally neighboring tiles are compatible, i.e.

hf (i, j), f (i + 1, j)i ∈ to . . . for all 1 ≤ i < 2n , 1 ≤ j ≤ 2n

and

hf (i, j), f (i, j + 1)i ∈ on . . . for all 1 ≤ i ≤ 2n , 1 ≤ j < 2n

**The TILING problem is defined as follows. Suppose that we are given a set
**

{t1 , . . . , tk } of tiles, compatibility relations on and to, and a number n written

in unary notation, the problem is to decide whether there exists a tiling f of the

square of size 2n × 2n with a distinguished tile type, say t1 , at the top left corner

(i.e., f (1, 1) = t1 ).

We describe a reduction that transforms any instance of the tiling problem to

an instance of the containment problem of conjunctive queries under an acyclic

set of cind’s and which requires only polynomial time relative to the size of the

problem instance.

5.3. SEMANTICS 91

x1 x2 x2 y1 y1 y2

x3 x4 x4 y3 y3 y4

x1 x2 y1 y2 x3 x4 x4 y3 y3 y4

x3 x4 y3 y4 z1 z2 z2 u1 u1 u2

z1 z2 u1 u2 z1 z2 z2 u1 u1 u2

z3 z4 u3 u4 z3 z4 x4 u3 u3 u4

**Figure 5.1: Hypertile of size i ≥ 2 (left) and the nine possible overlapping hyper-
**

tiles of size i − 1 (right).

**We define hypertiles as follows. Each composition of 2 × 2 tiles or hypertiles is
**

a hypertile if the component tiles satisfy the compatibility constraints. Obviously,

all hypertiles are of size 2i × 2i for some i ≥ 1 [DV97]. In our encoding, we define

hypertiles of level 1 by the following cind

**{hx1 , x2 , x3 , x4 i | ∃xf : til1 (xf , x1 , x2 , x3 , x4 , x1 )} ⊇
**

{hx1 , x2 , x3 , x4 i | to(x1 , x2 ) ∧ to(x3 , x4 ) ∧ on(x1 , x3 ) ∧ on(x2 , x4 )}

**Fortunately, for hypertiles of level i ≥ 2, it is not necessary to enforce that all the
**

compatibility constraints are satisfied on the level of tiles. Instead, it is sufficient

to verify that all of the nine possible (overlapping) constituent hypertiles of the

next-smaller level i − 1 (see Figure 5.1) satisfy the compatibility constraints. We

define hypertiles of level greater than one by

{hxf , yf , zf , uf , ti | ∃f : tili+1 (f, xf , yf , zf , uf , t)} ⊇

{hxf , yf , zf , uf , ti | ∃ x1 , . . . , x4 , y1 , . . . , y4 , z1 , . . . , z4 , u1 , . . . , u4 , d1 , . . . , d13 :

tili (xf , x1 , x2 , x3 , x4 , t) ∧

tili (yf , y1 , y2, y3 , y4 , d1) ∧

tili (zf , z1 , z2 , z3 , z4 , d2 ) ∧

tili (uf , u1, u2 , u3 , u4, d3 ) ∧

tili (d4 , x2 , y1, x4 , y3 , d5) ∧

tili (d6 , x4 , y3, z2 , u1, d7 ) ∧

tili (d8 , z2 , u1, z4 , u3, d9 ) ∧

tili (d10 , x3 , x4 , z1 , z2 , d11 ) ∧

tili (d12 , y3 , y4 , u1, u2 , d13 )}

**Let bot be a nullary predicate. To complete our encoding, we add cind’s
**

on(ti , tj ) ← bot. for each hti , tj i ∈ on and to(ti , tj ) ← bot. for each hti , tj i ∈ to.

where ti and tj are constants identifying pairs out of the k given tile types.

Let us consider the encoding shown above as a logic program (that we obtain

by normalizing the cind’s). The existential variables in the subsumer queries of

the tili cind’s will be transformed into function terms aggregating the 4 hypertiles

of the next smaller size. (In fact, also the variables for the top left corner tiles t

92 CHAPTER 5. QUERY REWRITING

**will be aggregated in the function terms, but this does not alter the correctness
**

of the encoding.) The cind for til1 is transformed into the Horn clause

til1 (f1 (x1 , x2 , x3 , x4 ), x1 , x2 , x3 , x4 , x1 ) ←

to(x1 , x2 ), to(x3 , x4 ), on(x1 , x3 ), on(x2 , x4 ).

and the cind’s for tili≥2 are normalized as Horn clauses with heads

tili (fi (x1 , x2 , x3 , x4 , t), x1 , x2 , x3 , x4 , t)

**During bottom-up evaluation of such a logic program, the function terms
**

constructed using fi correspond exactly with the valid hypertiles constructible

from the given k tile types, if the fifth arguments of function terms of symbols

fi≥2 are ignored.

It is quite easy to see that there is a solution for the TILING problem iff the

constraints in our encoding entail

{hi | bot} ⊆ {hi | tilm (f, x, y, z, u, 1)}

**Equally, there is a solution to the TILING problem exactly if the maximally
**

contained rewriting of {hi | tilm (f, x, y, z, u, 1)} in terms of the “source predicate”

bot is nonempty. Thus, these two problems are NEXPTIME-hard.

**Theorem 5.3.15 Let Σ be an acyclic set of cind’s and Q and Q′ be conjunctive
**

queries. Then the containment problem Σ Q ⊆ Q′ and the query rewriting

problem for conjunctive queries (under acyclic sets of cind’s) are NEXPTIME-

complete.

**Proof. As pointed out in Section 5.3.1, the query containment problem
**

under an acyclic set of cind’s can be solved by proving the unsatisfiability of the

negation of the containment, which decomposes into a set of ground facts (in

analogy with canonical database of the “freezing trick” of Example 2.2.3) and

a goal. This is a special case of the SUCCESS problem for nonrecursive logic

programs [DV97, VV98].

The problem of deciding whether query rewriting produces a nonempty set

of conjunctive queries can be reduced to the SUCCESS problem by introducing

unit clauses si (x1 , . . . , xni ) ←. (where x1 , . . . , xni are distinct variables) for each

“source” predicate si of arity ni .

As both problems are known NEXPTIME-hard from Lemma 5.3.14, com-

pleteness in NEXPTIME has been shown.

This result shows that by restricting ourselves to acyclic sets of cind’s we

have nevertheless retained all the expressive power for decision-making (modulo

polynomial transformations) of nonrecursive logic programming.

5.4. IMPLEMENTATION 93

5.4 Implementation

Our implementation is based on Algorithm 5.3.5, but makes use of several op-

timizations. Every time an MCD m is unfolded with a query to produce an

intermediate rewriting Q, we compute a query Q′ as follows.

Body(Q′) := {Bodyi (Q) | mi 6= ǫ}

Head(Q′ ) := hX̄i s.t. each

xi ∈ V ars(Head(Q)) ∩ V ars(Body(Q′))

Q′ is thus created from the new subgoals of the query that have been intro-

duced using the MCD. If Q′ contains non-source predicates, the following check

is performed. We check if our rewriting algorithm produces a nonempty rewriting

on Q′ . This is carried out in depth-first fashion. If the set of cind’s is cyclic, we

use a maximum lookahead distance to assure that the search is finite. If Q′ is not

further rewritable, Q is not processed any further but is dropped.

Subsequently, (intermediate) rewritings produced by unfolding queries with

MiniCon descriptions are simplified using tableau minimization.

Directly after parsing, Horn clauses whose head predicates are unreachable

from the predicates of the query are filtered out. The same is done with clauses

not in the set X computed by

X := ∅;

do X := X ∪ {c ∈ C | P reds(Body(c)) ⊆

Sources ∪ {P red(Head(c′)) | c′ ∈ X}};

while X changed;

We have implemented the simple optimizations known from the Bucket Algo-

rithm [LRO96] and the Inverse Rules Algorithm [GKD97] for answering queries

using views which are used to reduce the branching factor in the search process.

Beyond that, MiniCon descriptions are computed with an intelligent backtrack-

ing method that always chooses to cover subgoals first for which this can be done

deterministically (i.e., the number of Horn clauses that are candidates for un-

folding with a particular subgoal can be reduced to one), thereby reducing the

amount of branching.

Our unification algorithm allows to pre-specify variables that may in no case

be unified with a function term (e.g., for head variables of queries or atoms

already over source predicates). This allows to detect the impossibility to create

a function-free rewriting as early as possible.

In the implementation of the deterministic component of our algorithm for

generating MiniCon descriptions, we first check whether the corresponding pairs

of terms of two atoms to match unify independently before doing full unification.

This allows to detect most violations with very low overhead. Given an appro-

priate implementation, it is possible to check this property in logarithmic or even

constant time.

94 CHAPTER 5. QUERY REWRITING

**An important performance issue in Algorithm 5.3.5 is the fact that MCDs are
**

only applied one at a time, which leads to redundant rewritings as e.g. the same

MCDs may be applicable in different orders (as is true for the classical problem

of answering queries using views, a special case) and thus a search space that

may be larger than necessary. We use dependency graph-based optimizations to

check if a denser packing of MCDs is possible. For the experiments with layered

sets of cind’s reported on in Section 5.5 (Figures 5.3 and 5.4), MCDs are packed

exactly as densely11 as in the MiniCon algorithm of [PL00].

Distribution

The implementation of our query rewriter (with algorithms for both semantics

presented) consists of about 9000 lines of C++ code. Binaries for several plat-

forms as well as examples and a Web demonstrator that allows to run limited-size

problems online are made available on the Web at

http://cern.ch/chkoch/cindrew/

5.5 Experiments

A number of experiments have been carried out to evaluate the scalability of our

implementation. These were executed on a 600 MHz dual Pentium III machine

running Linux. A benchmark generator was implemented that randomly gener-

ated example queries and sets of cind’s. This program created chain as well as

random queries (and cind’s).

In all experiments, the queries had 10 subgoals, and we averaged timings over

50 runs. Sets of cind’s were always acyclic. This was ascertained by the use of

predicate indexes such that the predicates in a subsumer query of a cind only used

indexes greater that or equal to a random number determined for each cind, and

subsumed queries only used indexes smaller than that number. Times for parsing

the input were excluded from the diagrams, and redundant rewritings were not

eliminated12 . Diagrams relate reasoning times on the (logarithmic-scale) vertical

axis to the problem size as a number of cind’s on the horizontal axis.

5.5.1 Chain Queries

Chain queries are conjunctive queries of the form

q(x1 , xn+1 ) ← p1 (x1 , x2 ), p2 (x2 , x3 ), . . . , pn1 (xn−1 , xn ), pn (xn , xn+1 ).

11

See Section 3.6.2.

12

Note that CindRew optionally can make rewritings nonredundant and minimal. However,

for these experiments, these options were not active.

5.5. EXPERIMENTS 95

seconds

10

p=12

classical, p=16

1 p=8

p=16

0.1

0.01

**0.001 chain queries
**

unlayered

3−6 subgoals per query

#cind’s

0.0001

0 500 1000 1500 2000 2500 3000

Figure 5.2: Experiments with chain queries and nonlayered chain cind’s.

**Thus, chain queries are constructed by connecting binary predicates via vari-
**

ables, as shown above. In our experiments, the distinguished (head) variables

were the first and the last. The chain cind’s had between 3 and 6 subgoals in

both the subsuming and the subsumed queries.

We report on three experiments with chain queries.

The first diagram (Figure 5.2) shows timings for chain queries. By the steep

line on the left we report on an alternative query rewriting algorithm that we

have implemented and which follows the classical semantics, and a traditional

resolution strategy, where we unfold certain clauses where this is deemed appro-

priate as described in Algorithm 5.3.3. This is particularly effective with acyclic

sets of constraints that are as densely packed, as is the case here. The experiment

reported on here was carried out with 16 predicates. This algorithm is compared

to and clearly outperformed by CindRew (with three different numbers of pred-

icates; 8, 12, and 16). Since the more predicates are available, the sparser the

constraints get, more predicates render the query rewriting process simpler.

In the second diagram (Figure 5.4), we report on CindRew’s execution times

with cind’s that were generated with an implicit layering of predicates (with

2 layers). This experiment is in principle very similar to local-as-view rewriting

with p/2 global predicates and p/2 source predicates (where the subsumer queries

of cind’s correspond to logical views in the problem of answering queries using

views), followed by simple view unfolding to account for the subsumed queries of

cind’s. We again report timings for three different numbers of predicates.

In the third diagram, the same problem of finding a maximally contained

96 CHAPTER 5. QUERY REWRITING

seconds

10

p=8 p=12 p=16

1

0.1

0.01

chain queries

3−6 predicates per query

2 layers of predicates

#cind’s

0.001

0 500 1000 1500 2000 2500 3000

Figure 5.3: Experiments with chain queries and two layers of chain cind’s.

seconds

101

p=20 p=40

100

10−1

10−2

10−3

chain queries

10−4 3−6 predicates per query

5 layers of predicates

#cind’s

10−5

0 1000 2000 3000 4000 5000 6000

**Figure 5.4: Experiments with chain queries and five layers of chain cind’s.
**

5.5. EXPERIMENTS 97

seconds

10

1

random queries

predicate arity: 2−3

subsumers: 3−4 subgoals

subsumees: 2 subgoals

distinguished variables: 1−2

5 layers of predicates

#cind’s

0.1

0 500 1000 1500 2000 2500

Figure 5.5: Experiment with random queries.

**rewriting is solved for 20 and 40 predicates, which are grouped into a stack of
**

five layers of 4 and 8 predicates each, respectively. Of the five sets of predicates,

one constitutes the sources and one the “schema” over which queries are asked,

and four equally sized sets of cind’s bridge between these layers13 .

As can be seen by comparing the second and third diagrams with the first, the

hardness of the layered problems is more homogeneous. Particularly in Figure 5.2

and Figure 5.3, one can also observe subexponential performance. Note that in

the experiment of Figure 5.4, timings were taken in steps of 20 cind’s, while in

the other experiments, this step length was 100.

5.5.2 Random Queries

The random queries had either three or four subgoals in the subsumer and two

subgoals in the subsumed query. Predicates either had arity two or three, and

the number of distinguished variables was either one or two. The number of

existential queries was two to three times as high in order to reduce the number

of correct solutions. For the experiments carried with random queries, the number

of solution rewritings quickly got very large, so we report on computing at most

100 rewritings14 . Figure 5.5 shows the timings for random queries as described

13

See Section 5.2 for our definition of layered sets of cind’s.

14

In the runs with chain queries (and constraints), we of course computed all rewritings.

98 CHAPTER 5. QUERY REWRITING

**earlier, with five predicates and five layers (i.e., one predicate per layer). We
**

report only on the case with five layers because e.g. in the case where there were

no layers or fewer layers, computing the first 100 solutions was too easy.

5.6 Discussion

This chapter has addressed the query rewriting problem in data integration from

a fresh perspective. Expressive symmetric constraints are used, which we have

called Conjunctive Inclusion Dependencies. The problem of computing the max-

imally contained rewritings was studied under two justifiable semantics. We have

discussed their main theoretical properties and have shown that they coincide.

We have presented the second semantics, motivated by rewrite systems, as a valu-

able alternative to the classical logical. This semantics allows to apply time-tested

(e.g., tableau minimization) as well as more recent (e.g., the MiniCon algorithm)

techniques and algorithms from the database field to the query rewriting problem.

There are several advantages of algorithms following the philosophy of the

rewrite systems semantics for query rewriting. Under this semantics, intermedi-

ate results are (function-free) queries and can be immediately made subject to

query optimization techniques known in the database field. As a consequence,

further query rewriting may start from simpler queries, leading to an increase in

performance and fewer redundant results that later have to be found and elimi-

nated. Thus, it is often possible to detect dead ends early. As a trade-off (as can

be seen in Algorithm 5.3.5), an additional degree of nondeterminism is introduced

compared to resolution-based algorithms under the classical semantics.

In the context of data integration, there are usually a number of regulari-

ties in the way constraints are implemented and queries are posed. Usually we

expect to have a number of schemata, each containing a number of predicates.

Between the predicates of one schema, no constraints for data integration uses

are defined. Moreover, we expect inter-schema constraints usually to be of the

form Q1 ⊆ Q2 where most (or all) predicates in Q1 belong to one and the same

schema, while the predicates of Q2 belong to another. Queries issued against the

system are usually formulated in terms of a single schema. Given these assump-

tions, we suspect algorithms following the rewrite systems semantics which apply

optimization techniques from the database area on intermediate results to have a

performance advantage over classical resolution-based algorithms, which do not

exploit such layering heuristics.

Clearly, the noncomputability of rewritings in general is an important prob-

lem, which will be addressed in the following chapter. In particular we will argue

that one can often avoid to have cyclic definitions of cind’s in a system for finding

maximally contained rewritings15 .

15

Note that it is only reasonable to talk about equivalent rewritings if cind’s are allowed to

be cyclic.

Chapter 6

Model Management

**In the previous chapter, a detailed presentation of the query rewriting problem
**

with conjunctive inclusion dependencies has been given. Such inter-schema con-

straints are not only highly relevant to data integration because of their ability

to deal with concept mismatch that requires symmetric constraints (see Exam-

ple 1.3.1). This class of constraints also supports the construction of mappings

that are robust with respect to change.

This chapter starts with the definition of a very simple model for managing

relational schemata and mappings based on cind’s in a repository. Schemata are

simply sets of relations, without any additional semantics such as those commonly

encoded as functional dependencies or inclusion dependencies (e.g., unique keys

and foreign key constraints). This restriction, together with the one that we

confine this presentation to a purely relational rather than semantic or object-

oriented data model, makes this study a mostly theoretical one. However, this

allows us to concentrate on the main issues of supporting maintainability in a

concise way. Furthermore, such extensions are reasonably simple to realize1 .

In the second section of this chapter we discuss the problem of designing map-

pings that are robust and do not easily become complete failures when the data

integration requirements change. Finally, we attack the problem of arriving at

collections of inter-schema constraints that are acyclic, a property that Chapter 5

has shown to be desirable.

**6.1 Model Management Repositories
**

A model management repository is a pair hR, Mi of a set of relational schemata

R and a set of mappings M. For simplicity, we consider schemata as simple sets

of relation schemata without dependencies (which of course could be added to

the mechanism but which we leave out for simplicity). Each relational predicate

1

See Section 2.5 on the issue of applying our work on query rewriting to object-oriented

queries and Chapter 7 on the generalization of the query rewriting problem.

99

100 CHAPTER 6. MODEL MANAGEMENT

1. Add an empty schema R = ∅ to R.

**2. Copy the predicates of a schema R ∈ R into a new schema R′ . Mappings
**

of R are not linked to R′.

**3. Add a predicate to schema R. Predicates are identified by name within a
**

schema and have a unique fixed arity.

**4. Rename a predicate p of a schema R, as well as all of its occurrences in
**

mappings.

**5. Change the arity of a predicate p in a schema. (a) To add an attribute
**

to p at position i, each of its appearances in cind’s is augmented with a

new existential variable for that attribute. (b) An attribute may only be

removed from p if for every appearance of p in a cind, there is a variable

in this attribute position which is existentially quantified and not used in

a join.

**6. Delete a predicate p from a schema. This requires that p is not used in any
**

mapping.

**7. Delete a schema. This is only allowed if no predicate from the schema is
**

used in any mapping.

**8. Import a schema (from DDL, IDL, and DTD files, relational databases,
**

spreadsheet layouts, . . .)

Figure 6.1: Operations on schemata.

1. Add an elementary mapping Σ to M.

**2. Add a cind Q1 ⊇ Q2 to an elementary mapping Σ from schemata R1 , . . . , Rn
**

to schema R. Q1 must be over predicates in R and Q2 over predicates in

R1 , . . . , Rn .

3. Remove a cind from an elementary mapping Σ.

**4. Delete a mapping. In case of a composite mapping, all of its constituents
**

are deleted, including auxiliary schemata.

**Figure 6.2: Operations on mappings.
**

6.1. MODEL MANAGEMENT REPOSITORIES 101

1. UNFOLDM (R, ΣGAV)

Rewrite a schema R using a set of GAV views ΣGAV to achieve a finer

granularity of entities contained. For each view

p(X̄) ← p1 (X̄1 ), . . . , pn (X̄n )

**in ΣGAV , let p be a predicate in R and p1 , . . . , pn be new predicate names.
**

p is replaced in R by {p1 , . . . , pn } and all subsumer or subsumee queries of

cind’s in M that contain p are unfolded with the GAV view.

2. MERGEM (R1 , R2, R′ )

Merge two schemata R1 and R2 into a new schema R′ . This can be done if

R1 and R2 do not contain predicates of the same name but with different

arities and if there are no dependencies (via mappings) between any of the

predicates in R1 and R2 . Predicates from R1 and R2 with the same names

fall together. All predicates from R1 and R2 occurring in mappings in M

are replaced by the corresponding predicates from R′ .

**3. SPLITR,M (R, {R1, . . . , Rm }, {Rm+1, . . . , Rn }, R′)
**

Distribute the role of a schema R in the data integration infrastruc-

ture across R and a new schema R′ . Let {R1 , . . . , Rm}, {Rm+1 , . . . , Rn }

be a partition of all schemata against which R is mapped in M (i.e.,

{R1 , . . . , Rm, Rm+1 , . . . , Rn } is the set of schemata {X ∈ R | ∃M ∈ M :

R ∈ f rom(M) and to(M) = X}). Copy R to a new schema R′ . Copy

all the mappings M with to(M) = R and change all occurrences of predi-

cates in R by their copies in R′ . For all mappings in M against schemata

Rm+1 , . . . , Rn , replace the predicates from R by their copies in R′ .

This operation is close to being the inverse of the previous merge operation.

**4. Eliminate an auxiliary schema R by unfolding the mappings from R with
**

the mappings against R, if all the constraints thereby created are cind’s.

This condition is guaranteed to hold if all mappings are GAV.

5. COMPOSEM (A)

Create a composition of (existing) mappings around a (now auxiliary)

schema A, as described in the definition of composite mappings.

**6. Ungroup a composite mapping. This is needed when an auxiliary schema
**

has matured and is to be (re-)used outside the mapping.

Figure 6.3: Complex model management operations.

102 CHAPTER 6. MODEL MANAGEMENT

**is either marked “source” or “logical”. A schema is called purely logical if it does
**

not contain source predicates. Relational attributes may be named and typed.

If they are unnamed, we refer to them by their index. Relational predicates are

unique across all schemata – they are identified by their schema id in combination

with their predicate name.

A mapping M maps from a set of schemata in R1 , . . . , Rn ∈ R (denoted

f rom(M) = {R1 , . . . , Rn }) against a single schema R (denoted to(M) = R).

We require that to(M) 6∈ f rom(M).

Mappings are either elementary or composite. An elementary mapping Σ

is a set of cind’s where the subsumer sides of the constraints only use logical

predicates from R and the subsumed sides only use predicates from R1 ∪. . .∪Rn .

The dependency graph of an elementary mapping thus has a diameter of one. A

composite mapping M can be created from a schema A, a mapping M0 , and a

set of mappings {M1 , . . . , Mn } if

1. A ∈ f rom(M0 ),

2. A = to(Mi ) for each 1 ≤ i ≤ n,

3. A is purely logical and

4. A is not used in any other mapping in M besides M0 , . . . , Mn .

**A is called an auxiliary schema. We require the cind’s in the union of all
**

elementary mappings of a composite mapping to be acyclic.

We do not provide an exhaustive list of all model management operations

imaginable. Figure 6.1 and Figure 6.2 list operations for the manipulation of

schemata and mappings, respectively. Figure 6.3 shows some of the more inter-

esting complex operations. Model management software can do a very useful job

in supporting a human expert in the manipulation tasks. For instance, when

a new attribute is added to a relation, all of its occurrences in cind’s can be

automatically expanded with a new existentially quantified variable.

**6.2 Managing the Change of Schemata and Re-
**

quirements

The lack of a global schema against which sources can be integrated leads to

the problem that mappings grow with the square of the number of schemata.

Together with the prospect that schemata may evolve, this leads to a serious

management and maintenance problem.

We approach this problem using two principal techniques. These are the de-

coupling of dependencies between mappings with respect to change (Section 6.2.1)

6.2. MANAGING CHANGE 103

**and the merging and clustering of design artifacts of the data integration archi-
**

tecture wherever possible to reduce redundancy and the number of such artifacts

to be managed (Section 6.2.2).

6.2.1 Decoupling Mappings

Given a number of schemata and mappings expressing dependencies between

them, a risk exists that some minor modification to a mapping (which may be

complex and work-intensive to design) renders its complete redesign necessary.

Similarly, the change of a schema or the data integration requirements regarding

a schema may invalidate several mappings.

Two goals are immediate consequences of this:

**• Firstly, change of a set of views should remain as local as possible. When-
**

ever a source is added, we only want to add a single (or few) logical views,

but hopefully do not have to carry out a major redesign of mappings. It has

been observed that local-as-view integration supports the simple addition

and removal of source mappings (see Section 3.9).

**• Secondly, mappings should decouple source and integration schemata from
**

each other in the sense that the change of an integration schema (i.e.,

schema evolution) or its data integration requirements have a minor im-

pact on the “other end” of a mapping, the part of the description that is

responsible for integrating sources.

**Composite mappings as defined in the previous section permit the design of
**

layers of inter-schema constraints that may be attributed different roles. The

high expressiveness of our query rewriting formalism allows for such layers, for

instance, to be either sets of LAV or GAV views. The resulting design potential

enables us to create mappings that make intuitions regarding likely future change

explicit in the mappings and to prepare for this change. We can attribute dedi-

cated integration roles to individual layers, as shown in the following example.

**Example 6.2.1 Let there be a fixed integration schema R with a single relation
**

R.book, against which we would like to integrate four sources S1 , S2 , S3 , S4 and

five source relations S1 .book, S2 .book, S3 .book, S4 .sales and S4 .categories.

We define a composite mapping between R and its sources that consists of

three layers (created using operation 5 of Figure 6.3 twice) and two auxiliary

schemata, A1 and A2 . We use three auxiliary predicates, A1 .book, A1 .sales, and

A2 .s′4 . The outermost, a GAV mapping from S4 to A2 , takes over the task of

pre-filtering sources. The middle mapping from A2 to A1 follows the local-as-

view approach and takes over the main source integration role. The innermost

mapping, again GAV, projects from our well-designed auxiliary schema A1 to R.

Consider the following constraints. (See also Figure 6.4.)

104 CHAPTER 6. MODEL MANAGEMENT

S1

S2

GAV

LAV

R A1

S3 M3

M1

GAV

A2 S4

M M2

**Figure 6.4: Data integration infrastructure of Example 6.2.1. Schemata are vi-
**

sualized as circles and elementary mappings as arrows.

**M3 : “Pre-filtering” (GAV); f rom(M3 ) = {S4 }, to(M3 ) = A2
**

A2 .s′4 (Name, P roducer, P rice, Sales, Units) ←

S4 .sales(CategoryId, Name, P roducer, P rice, Sales, Units),

S4 .categories(CategoryId, ”Books”).

M2 : “Source integration” (LAV); f rom(M2 ) = {S1 , S2 , S3 , A2}, to(M2 ) = A1

**{hIsbn, Name, Authori | S1 .book(Isbn, Name, Author)} ⊆
**

{hIsbn, Name, Authori | ∃P rice, P ublisher :

A1 .book(Isbn, Name, Author, P rice, P ublisher)}

**{hIsbn, Name, P ublisheri | S2 .book(Isbn, Name, P ublisher)} ⊆
**

{hIsbn, Name, P ublisheri | ∃Author, P rice :

A1 .book(Isbn, Name, Author, P rice, P ublisher)}

**{hName, Author, Sales, Unitsi | S3 .book(Name, Author, Sales, Units)} ⊆
**

{hName, Author, Sales, Unitsi | ∃Isbn, P rice, P ublisher :

A1 .book(Isbn, Name, Author, P rice, P ublisher),

A2 .sales(Isbn, Sales, Units)}

**{hName, P ublisher, P rice, Sales, Unitsi |
**

A2 .s′4 (Name, P ublisher, P rice, Sales, Units)} ⊆

{hName, P ublisher, P rice, Sales, Unitsi | ∃Isbn, Author :

A1 .book(Isbn, Name, Author, P rice, P ublisher),

A2 .sales(Isbn, Sales, Units)}

M1 : “Customizing” (GAV); f rom(M1 ) = {A1 }, to(M1 ) = R

**R.book(Name, Author, P rice, P ublisher) ←
**

A1 .book(Isbn, Name, Author, P rice, P ublisher).

6.2. MANAGING CHANGE 105

**We have created the GAV view A2 .s′4 assuming that CategoryId is only used
**

in that source, and have anticipated that no other future sources will provide it,

making it easier to leave the schema against which the LAV views are mapped

unchanged. On the other hand, ISBN codes are or will be provided by several

sources and are relevant to integration, although our legacy integration schema

does not know them. As a consequence, we have created an auxiliary integra-

tion schema, and provide a GAV mapping between the auxiliary and the legacy

integration schema. We have also added a “sales” predicate to it, assuming that

many sources will provide sales information and our action will save us from

creating many GAV views that project these attributes out.

**Example 6.2.1 has used a three layer (GAV-LAV-GAV) integration strategy,
**

where dedicated roles (1) customizing, (2) source integration and (3) pre-filtering

were assigned to the three layers M1 , M2 , and M3 . The LAV layer M2 assumes

the role of taking over most of source integration. If sources have to be integrated

against an information system for which the schema lacks properties necessary for

LAV integration, the LAV layer integrates against an auxiliary schema (schema

A1 in Example 6.2.1) that extends the integration schema by these properties.

The first (GAV) layer M1 maps the predicates of the auxiliary schema against

the (legacy) integration schema. The third layer M3 may be used to filter out

data or project out attributes that are irrelevant for the integration purpose at

hand, such that the (auxiliary) integration schema and with it the LAV views do

not have to be changed more often than absolutely necessary.

Intuitively, this strategy should allow for convenient and maintainable data

integration in a large number of scenarios. The LAV layer provides locality

of change when sources are added (or deleted), and the entirety of these three

layers facilitates decoupling when an integration schema changes. Changes to

an integration schema can often be absorbed by the pre-filtering GAV views

of mappings from such a schema and the customizing GAV views of mappings

against such a schema. Thus, changes to the data integration infrastructure

usually remain local and reasonably simple to manage.

Adding Sources

This motivates the following steps for adding sources2 (see Figure 6.5 for the

development stages of the set of views of a given legacy integration schema):

**• Initially, we attempt to use LAV to integrate the sources against the inte-
**

gration schema.

2

Of course, the rules given here should be followed less strictly if the designer of mappings

anticipates some future change and designs a more sophisticated auxiliary integration schema

that deviates more from the legacy integration schema.

106 CHAPTER 6. MODEL MANAGEMENT

**Integration schema Integration+
**

Query Pre-filtering

View layer

GAV

LAV

IM Query

GAV

GAV

LAV

Query Integration IM

LAV

IM

Query Customizing+

Integration+

GAV

LAV

IM Pre-filtering

Customizing+

Integration

Figure 6.5: The lifecycle of the mappings of a legacy integration schema.

**• If there are source attributes that do not exist in the integration schema3 ,
**

make a choice depending on whether these source attributes are likely to

occur in many other sources or not. If the answer is yes, copy the predicates4

to which they should be most naturally added, and add the attributes. Use

the altered auxiliary schema for LAV while at the same time providing GAV

views from the altered predicates to the original versions in the integration

schema, essentially just projecting out the added attributes. This is a

nonlocal change. However, all the logical views which have been there

before and use changed predicates can be altered automatically (a simple

dummy attribute has to be introduced at the right position). Otherwise,

add a GAV view before the LAV stage that projects out these attributes.

Auxiliary schemata for LAV integration can also be generalized using the

UNFOLD operator of Figure 6.3.

**• If some prefiltering of data available through sources (see Example 6.2.1)
**

is needed, decide whether the predicates of other future sources are likely

to be in similar ways more general than the current schema against which

LAV integration is carried out. If so, generalize the auxiliary integration

schema (if LAV integration is carried out against the legacy schema, copy it

first) and provide proper GAV views. Otherwise, add a GAV view between

a source and the LAV views.

**Of course there is a varying degree of intuition that can be put into auxiliary
**

integration schemata in order to facilitate future maintenance. On the parsimo-

nious side, auxiliary integration schemata are only changed when this is really

needed. For the other extreme we may attempt to design a kind of “global”

3

For instance, this is the case for the Isbn attribute of several sources in Example 6.2.1.

4

That is, create an auxiliary integration schema that is equal to the integration schema

apart from a number of predicates that are adapted to be able to map the sources in question.

6.2. MANAGING CHANGE 107

IM

GAV

GAV

LAV

IM AUX

GA

1

V

GAV

AUX

LAV

1+2

GA

GAV

GAV

LAV

AUX

V

IM

2 IM

Figure 6.6: Merging auxiliary integration schemata to improve maintenance.

**integration schema, allowing to combine the source integration of several similar
**

information systems that subscribe to similar sources. This is discussed in more

detail in the following section.

6.2.2 Merging Schemata

The second main technique for simplifying the management of schemata and map-

pings in our architecture is based on the attempt to merge (auxiliary) schemata

in the tradition of [BLN86, KDB98] (using the MERGE operation of Figure 6.3)

whenever possible, or even to develop global schemata of limited scope5 that are

well designed and prepared for kinds of future change that are likely to occur.

**Reusing Auxiliary Schemata
**

It may be reasonable to use the predicates of an auxiliary integration schema of

an information system rather than its legacy schema as sources to yet another

information system. This is particularly appropriate if the intuitively perceived

quality of the former is much higher than the quality of the latter. Other reasons

may be that the GAV views mapping the auxiliary integration schema against

the legacy integration schema filter out relevant data.

This leads us to the possibility of reusing auxiliary integration schemata,

which may eliminate redundant work and greatly simplify the maintenance task.

Such a step may be justified if several information systems have similar integra-

tion requirements (need similar information from sources) and if the adjustments

that will be needed are expected to correlate heavily when it comes to change

of sources. If this is the case, auxiliary integration schemata can be merged into

one. The schema merging task can for example be carried out by defining a suit-

able “more global” auxiliary schema for the given auxiliary integration schemata,

defining appropriate GAV views to map the predicates of the old schemata against

the new one, and then generalizing these schemata and their mappings by un-

folding (by using the UNFOLD operation of Figure 6.3).

5

These are similar to export schemata in federated databases [SL90]

108 CHAPTER 6. MODEL MANAGEMENT

IS

Src1

Integration

Schemata IS AUX Src2 Sources

Src3

IS

**Figure 6.7: A clustered auxiliary schema. Schemata are displayed as circles and
**

mappings as arrows.

Clustering

For instance, consider again the case of the LHC project (see Section 1.3). There

are groups of information systems that, although they are based on different

schemata, satisfy similar needs (are in the same stage of the project lifecycle)

for different subprojects. For such clusters, it may be wise to create a “global”

information system or data warehouse (from which the individual information

systems basically receive their data through a simple GAV mapping) whose aim

is restricted to that particular step of the lifecycle (as noted, building a global

schema for the whole lifecycle may not be possible), and which concentrates

source integration against its global schema.

Figure 6.7 depicts such a shared auxiliary schema. Even if data integration

is carried out on demand (i.e. using the “lazy approach” to data integration

[Wid96]), one can think of such an approach as an analogy to data warehouses

(the clustered schemata) and data marts (the individual integration schemata).

The SPLIT operation of Figure 6.3 allows to take back clustering decisions

if integration schemata making use of such “global” schemata evolve in different

ways and the clusters become unsustainable.

The creation of a “global” auxiliary integration schema for several similar

information systems also simplifies the task of avoiding circularities in definitions

of constraints caused by information systems mutually using each other’s virtual

predicates.

**6.3 Managing the Acyclicity of Constraints
**

It is clearly a goal to have the set of all cind’s in a data integration system be

acyclic, as that property guarantees the computability of rewritings. Cyclic sets

of cind’s mean a self-referential definition of the source-to-integration predicate

relationships. Rewritings produced using the results of Chapter 5 may in theory

6.3. MANAGING THE ACYCLICITY OF CONSTRAINTS 109

**be of infinite size.
**

One could give up the completeness requirement and could produce rewrit-

ings that are guaranteed to be sound but may be incomplete, simply by setting a

threshold to processing time or the number of constraints used. Our intuition is

that in practice, when real-world constraints for data integration are encoded, the

rewriting process will terminate with a complete result in most cases. Alterna-

tively, the query rewriting tool could, given a query and with some justification,

cut away e.g. those cind’s whose directed edges in the dependency graph are most

distant from the predicates in the query and which occur in a cycle.

If the process of designing mappings between schemata is computer-supported,

a system could help to avoid such situations. Acyclicity can be enforced auto-

matically all through the design process of mappings and should not be perceived

as too restrictive in that case.

The clustering of auxiliary schemata combining logical predicates that rep-

resent integrated sources and which are to be connected to several “subscriber”

schemata clearly supports the goal of escaping cyclicity. In the extreme case, one

could aim at defining auxiliary schemata that are commonly used by all informa-

tion systems requiring access to certain resources, while making sure that none

of the mappings against these resources share any of the logical predicates used

in the earlier mappings.

110 CHAPTER 6. MODEL MANAGEMENT

Chapter 7

Outlook

**This chapter first presents the problem of providing physical data independence
**

under schema evolution in Section 7.1. This is another realistic application of

query rewriting with cind’s, outside of data integration. It is a straightforward

generalization of the problem of maintaining physical data independence analo-

gous to the transition from data integration via the problem of answering queries

using views to data integration by query rewriting with cind’s.

In the remainder of the chapter, we discuss extensions of query rewriting

with cind’s (which has so far only been considered in the context of relational

conjunctive queries) that are analogous to those that have been proposed for the

problem of answering queries using views. A few issues worth considering are

• Recursive queries. We address the query rewriting problem with recur-

sive (datalog) queries and nonrecursive sets of cind’s in Section 7.2. This

problem can be solved easily as a generalization of the work in [DG97].

• Sources with binding patterns within the data integration architecture pre-

sented in Chapter 4 are relevant for two reasons. Firstly, this feature may

be required for the integration of sources with restricted query interfaces

such as legacy systems.

Secondly, this allows to include procedural code for transforming data. This

may permit a gateway to different approaches to data integration that may

coexist in a heterogeneous data integration infrastructure. Another appli-

cation may be procedures that implement complex data transformations.

Of course, it has been observed that most practical database queries are

of very simple nature, and that very restricted query languages (with their

favorable theoretical properties) cover most practical needs, particularly of

non-expert users. This, however, does not always remain true. Certain

classes of queries that are needed in the real world (particularly in engi-

neering environments such as in our use case of Section 1.3) are sufficiently

hard that cannot be carried out using the query language supported by the

data integration platform and the underlying reasoning method.

111

112 CHAPTER 7. OUTLOOK

Integration Interface

Schema Schema

**Interface with binding
**

pattern exported by

the procedure cross-

cind Procedure

constraints

Relations accessed

by the procedure

"Source" "Source"

Schema Schema

(A) (B)

**Figure 7.1: A cind as an inter-schema constraint (A) compared to a data transfor-
**

mation procedure (B). Horizontal lines depict schemata and small circles depict

schema entities. Mappings are shown as thin arrows.

**The solution to this is to encapsulate advanced data transformations in a
**

“procedure”, that is, a construct that, for the purposes of data integration

and query rewriting, is only described externally, by its interface. The

procedure itself may contain a query in a highly expressive query language

or a piece of code in a high-level programming language.

The tradeoff made is the following: Query rewriting reasoning is simplified

and often only made possible, and certain complicated queries may be hard-

wired in efficient, problem-specific code. On the downside, the completeness

of rewriting compared to queries that are not just externally described is

lost when procedures are used.

If such data transformation procedures are embedded in the data integra-

tion architecture in the sense that they read out (possibly integrated) data

from information systems that are inside the infrastructure as well, one

may describe constraints that hold between interfaces and schemata of ac-

cessed data (see Figure 7.1) using e.g. a description logics formalism such

as in [BD99]. Constraints of this kind could be used to bound the query

rewriting process and eliminate irrelevant rewritings. Such a hybrid ap-

proach of query rewriting and description logics reasoning would be highly

interesting, though necessarily incomplete.

The query rewriting problem with binding patterns in the case of acyclic

sets of cind’s can be reduced to the problem addressed in [DGL00] by the

transformation described in Section 7.2.

**• Object-oriented and semistructured schemata and queries. We have dis-
**

cussed the equivalence of (the range-restricted versions of) nested relation

7.1. PHYSICAL DATA INDEPENDENCE 113

**calculus and relational calculus in Section 2.5. Given this, the rewriting
**

of conjunctive nested relation calculus queries and analogous constraints

can be simulated by the relational case by a simple syntactic transforma-

tion (see e.g. Example 2.5.1). This covers a practically relevant class of

queries in the complex object model that can be mapped straightforwardly

to object-oriented data models (see also [LS97]).

Semistructured data models (e.g. OEM [AQM+ 97] or ACeDB [BDHS96])

have recently received much interest due to the vision of considering the

World Wide Web as a single large database [AV97a, FLM98], and the rise of

XML-related technologies as a major standard for data exchange [ABS00].

The semistructured case can to a certain extent be seen as a special case of

the object-oriented. However, a special case of recursive queries – regular

path queries – are an important aspect of semistructured database queries

[CM90, Abi97, AV97b]. We address the rewriting of recursive queries under

cind’s in Section 7.2, as mentioned. For local-as-view integration in the

semistructured context, particularly with regular path views, see [PV99,

CDLV99, CDLV00b, CDLV00a].

**• Conjunctive queries with inequalities. Although practically relevant, this
**

issue is left open for future research. A special case is discussed in Footnote 3

in Section 7.1.

Query rewriting with cind’s and functional dependencies is another topic

of future research.

**7.1 Physical Data Independence under Schema
**

Evolution

7.1.1 The Classical Problem

Database systems are based on the assumption of a separation between a logical

schema and a physical storage layout, which represents an important factor for

their popularity. In fact, however, this independence between the logical and the

physical schema is not really given in state-of-the-art database systems. This is

at least true for relational database systems, where relations are usually really

stored as files, which are quite straightforward serializations of the data under the

logical schema. For object-oriented schemata, the physical and logical schemata

in practice do not coincide that closely. Otherwise, there would be too much

redundancy. Still, there is usually a fixed canonical relationship between physical

and logical schemata.

True physical data independence would be worthwhile, as it would allow to

define a logical schema according to design and application requirements and

114 CHAPTER 7. OUTLOOK

required_course

name

name

teaches

faculty course_id

course

date

exam_taken

professor

grade

works_in

st_id

leads student

name

dept_id

advisor

department

phd masters

name student student

address research_area second_period

Figure 7.2: EER diagram of the university domain (initial version).

**a physical schema optimized for performance. Currently, the coupling between
**

physical and logical schemata does not permit this, requiring to depart from

schemata that follow domain conceptualizations to attain satisfactory perfor-

mance.

Work on improving this situation (in particular, GMAP [TSI94]) has defined

physical storage structures as materialized views over the logical schema. That

way, answering queries requires local-as-view query rewriting (which is not harder

than NP-complete in the size of the query [LMSS95]), and the database update

problem is comparatively simple (it is the view maintenance problem [AHV95],

concerned with propagating changes to base tables incrementally to views, s.t.

views do not need to be fully refreshed whenever a change occurs). This task

would be substantially more complicated if the relationship between the logical

schema and the physical storage structures were defined the other way round,

i.e., the logical relations as views over the physical. In that case, the view update

problem [BS81, FC85] would have to be solved. The approach of [TSI94] also

allows to improve performance for classes of similar queries that are often asked,

simply by adding further storage structures that are defined as views similar to

those queries.

**Example 7.1.1 We use the popular university domain that has been previously
**

used to communicate the essentials of the maintenance of physical data indepen-

7.1. PHYSICAL DATA INDEPENDENCE 115

**dence [TSI94, Lev00]. Consider the logical schema of Figure 7.21 . This translates
**

into the following relational schema. Primary key attributes are underlined.

**v1 .student(StudId, Name)
**

v1 .masters student(StudId, SecondPeriod)

v1 .phd student(StudId, ResearchArea, Advisor)

v1 .professor(Name, Leads DeptId)

v1 .faculty(Name)

v1 .course(CourseId, Name, RequiredExam CourseId, CurriculumName)

v1 .teaches(Name, CourseId)

v1 .exam taken(StudId, CourseId, Date, Grade)

v1 .department(DeptId, Name, Address)

v1 .works in(FacName, DeptId)

**All students are either masters or PhD students. Full professors are managed
**

separately from other faculty (e.g. research or teaching assistants). Each professor

leads a department. Faculty may work in possibly several departments. Full

names of professors and other faculty are assumed to be unique in the combined

domain of such names2 . Courses are taught by professors or other faculty, have an

id number and may require up to one other course for which students must have

successfully passed the exam to be admitted. If a course has no such requirement,

a NULL value is stored for the attribute RequiredExam CourseId, rather than a

course id. PhD students have a professor as their advisor and an assigned area

of research. Masters students are either in their first or second period of their

studies, and this state is stored as a flag second period.

Let us now have the following physical storage structures, which are defined

as views over the logical schema.

**m1 (StudId, StudName, Area, Advisor, DeptId, DeptName, DeptAddress) ←
**

v1 .student(StudId, StudName),

v1 .phd student(StudId, Area, Advisor),

v1 .works in(Name, DeptId),

v1 .department(DeptId, DeptName, DeptAddress).

**m2 (Name, LeadsDeptId, DeptName, DeptAddress) ←
**

v1 .professor(Name, LeadsDeptId),

v1 .department(LeadsDeptId, DeptName, DeptAddress).

**m3 (StudId, StudName, CourseId) ←
**

v1 .student(StudId, StudName),

1

The schema is presented as an Extended Entity Relationship (EER) diagram [TYF86,

Che76], i.e., with is-a relationships, which are drawn as arrows with white triangular heads.

2

We intentionally outline a less-than-perfect schema.

116 CHAPTER 7. OUTLOOK

**v1 .course(CourseId, CourseName, Req, Curriculum),
**

v1 .exam taken(StudId, Req, Date, Grade).

**Now consider the following query, which asks for names of PhD students who
**

work (e.g. as teaching assistants) in departments not lead by their advisors3 .

**q(StudName) ← v1 .professor(AdvisorName, LDeptId),
**

v1 .department(LDeptId, LDeptName, LDeptAddress),

v1 .student(StudId, StudName),

v1 .phd student(StudId, Area, AdvisorName),

v1 .works in(StudName, SDeptId),

v1 .department(SDeptId, SDeptName, SDeptAddress),

LDeptAddress 6= SDeptAddress.

**In this context, materialized views – the physical storage structures – are
**

assumed complete and up-to-date. Thus, view m2 , for instance, has the meaning

**{hName, LeadsDeptId, DeptName, DeptAddressi |
**

m2 (Name, LeadsDeptId, DeptName, DeptAddress)} ≡

{hName, LeadsDeptId, DeptName, DeptAddressi |

v1 .professor(Name, LeadsDeptId) ∧

v1 .department(LeadsDeptId, DeptName, DeptAddress)}

**By solving the problem of answering queries using views, the following equiv-
**

alent rewriting of the input query can be found, in which all predicates of the

logical schema have been replaced by materialized views.

3

This query contains an inequality. Since the constraints (views) do not contain any in-

equalities, the query may be decomposed into

**q(StudN ame) ← q ′ (StudN ame, LDeptAddress, SDeptAddress),
**

LDeptAddress 6= SDeptAddress.

**q ′ (StudN ame, LDeptAddress, SDeptAddress) ← v1 .professor(AdvisorN ame, LDeptId),
**

v1 .department(LDeptId, LDeptN ame, LDeptAddress),

v1 .student(StudId, StudN ame),

v1 .phd student(StudId, Area, AdvisorN ame),

v1 .works in(StudN ame, SDeptId),

v1 .department(SDeptId, SDeptN ame, SDeptAddress).

**Thus, our algorithms from Chapter 5 are sufficient for finding maximally contained rewritings
**

of conjunctive queries with inequalities under sets of cind’s without inequalities. Of course, the

compositionality of conjunctive queries is preserved when inequalities are introduced. Thus,

the rewriting of q ′ can be unfolded with q to obtain a maximally contained positive rewriting

with inequalities.

7.1. PHYSICAL DATA INDEPENDENCE 117

q(StudName) ←

m1 (SId, StudName, A, P rof Name, SDId, SDName, SDeptAddress),

m2 (P rof Name, LDId, LDName, LDeptAddress),

LDeptAddress 6= SDeptAddress.

**Note that the physical storage structures m1 , m2 and m3 are not sufficient to
**

fully cover the logical schema; For instance, faculty other than PhD students are

not represented and “teaches” relationships are nowhere stored. Thus, additional

physical structures would be needed in practice.

**It is easy to see that the problem of providing physical data independence is of
**

wide practical importance. Note that in [TSI94], each physical storage structure

is indexed over either a relation attribute or a ROW id (as a relational equivalent

of object identifiers. Since the work is presented in the light of a semantical data

model, the term object id is used as is, however.) The query rewriting problem

thus becomes the problem of answering queries using views with binding patterns,

as discussed in Section 3.6. Binding patterns, however, are considered in a weak

form – if no rewriting can be produced, binding patterns are ignored (equivalent

to ignoring an index and scanning the whole relation or materialized view).

**7.1.2 Versions of Logical Schemata
**

Let us now assume that logical schemata may evolve. For several reasons, it may

be desirable not to rebuild storage structures each time schemata evolve.

**• Physical storage structures (currently) need to be designed manually for
**

optimizing performance4 . This requires expert work, which often is not

justified for a minor schema change that does not greatly affect the appro-

priateness of current physical storage structures.

**• Materialized views may be very large and be accessed rarely, such that
**

the cost of rebuilding physical structures relative to the cost of accessing

them must not be assumed zero. This is for instance the case in very large

(Terabyte or Petabyte) scientific repositories that are written only once –

to tertiary storage, e.g. tape robots – and where individual data records are

subsequently only accessed very sparingly. In that case it is worthwhile

to leave physical structures unchanged whenever possible and define new

versions of logical schemata relative to existing logical schemata versions as

well as the physical structures.

**• Sometimes, data in physical storage structures must not be lost when new
**

logical schema versions do not make use of them anymore. Reasons for that

4

According to [Lev00], this is an important area of future database research.

118 CHAPTER 7. OUTLOOK

required_course

name

name

teaches

faculty course_id

course

date

exam_taken

name grade

works_in

professor st_id

student

research name

dept_id

leads interest

advises

department

graduate undergrad.

name phd student student

program

address major

phd_program_id

Figure 7.3: EER diagram of the university domain (second version).

**may be that a database may still be addressed under the old logical schema
**

by certain applications or that there is reason to expect future schema

versions to make use of these data again.

**• Physical storage structures may be read-only or replica of databases that
**

are offline (e.g. in mobile, distributed applications).

**Given that concepts in different schema versions may experience true shift
**

of meanings (concept mismatch), cind’s manifest themselves as appropriate for

encoding such inter-schema dependencies. We next give an example showing why

query rewriting with cind’s may be relevant in this context. A number of serious

problems are left open, however, and are shortly summarized after this example,

at the end of the section. The main assumption that we make is that queries over

the logical schema may be translated into maximally contained positive queries

(rather than equivalent conjunctive queries5 ) over the storage structures.

**Example 7.1.2 Let us now define the following alterations to the logical schema
**

v1 of Example 7.1.1. Professors are now members of the faculty. The university

changes from a pure graduate school to also accommodate undergraduate stu-

dents. Both masters and PhD students are replaced by a new category, graduate

5

Note that a query equivalent to a conjunctive query under a set of cind’s must itself be a

conjunctive query.

7.1. PHYSICAL DATA INDEPENDENCE 119

**students. The two periods of masters studies cease to exist, but there is a new
**

field, “major”, for undergraduates. PhD research areas are represented by a log-

ical relation research interest, which is also used for managing the research areas

of faculty. There is a new relation phd program, which has its own key referenced

by a new advises relationship with a professor. Not every professor leads a de-

partment anymore, so there is a new relation leads. The schema is again shown

as an EER diagram in Figure 7.3.

**v2 .student(StudId, Name)
**

v2 .undergraduate student(StudId, Major)

v2 .graduate student(StudId)

v2 .phd program(Id, StudId)

v2 .research interest(Name, Area)

v2 .advises(Advisor, PhdProgramId)

v2 .faculty(Name)

v2 .professor(Name)

v2 .course(CourseId, Name, RequiredExam CourseId, CurriculumName)

v2 .teaches(Name, CourseId)

v2 .exam taken(StudId, CourseId, Date, Grade)

v2 .department(DeptId, Name, Address)

v2 .leads(ProfName, DeptId)

v2 .works in(FacName, DeptId)

**We define the following cind’s and leave cind’s that map predicates whose
**

meanings do not change from v1 to v2 as a (very) simple exercise for the reader.

**{hStudId, StudName, P rof Name, Areai | ∃P hDP rogramId :
**

v2 .student(StudId, StudName) ∧ v2 .graduate student(StudId) ∧

v2 .research interest(StudName, Area) ∧

v2 .advises(P rof Name, P hDP rogramId) ∧

v2 .phd program(P hDP rogramId, StudId)} ⊇

{hStudId, StudName, Advisor, Areai |

v1 .phd student(StudId, Advisor, Area) ∧ v1 .student(StudId, StudName)}

**{hName, DeptIdi | v2 .professor(Name) ∧ v2 .faculty(Name) ∧
**

v2 .works in(Name, DeptId) ∧ v2 .leads(Name, DeptId)} ⊇

{hName, Leads DeptIdi | v1 .professor(Name, Leads DeptId)}

v2 .graduate student(StudId) ← v1 .masters student(StudId, SecondP eriod).

**With this second version of the logical schema, it is also necessary to define
**

additional physical storage structures to accommodate new data such as under-

graduate majors:

120 CHAPTER 7. OUTLOOK

**m4 (StudId, StudName, Major) ←
**

v2 .student(StudId, StudName),

v2 .undergraduate student(StudId, Major).

**A subsequent third version of the logical schema could be defined using cind’s
**

relating to predicates of the previous versions as well as the physical storage

structures.

**As mentioned, we have left a number of important aspects of the problem of
**

maintaining physical data independence under schema evolution out of consider-

ation. In the context of this problem, query rewriting usually aims at producing

equivalent rather than maximally contained rewritings. If no equivalent one can

be found, no rewriting at all is produced. Rewritings over physical storage struc-

tures are usually assumed to return the same results as the original queries over

the logical schema. The problem of finding equivalent rewritings over cind’s,

however, entails cyclic sets of such constraints, for which we know that neither

maximally contained nor equivalent rewritings can be computed in general.

There are two pragmatic solutions to this problem, apart from the obvious

one of searching for an equivalent rewriting up to a time or memory consumption

threshold. Firstly, one could define maximally containedness as the “correct”

semantics. That way, results will be complete for the case that an equivalent

rewriting exists, and logically still justified otherwise6 .

Alternatively, one could first compute the maximally contained rewriting of a

query (over an acyclic set of cind’s composed of containment rather than equiv-

alence constraints) and then reverse the containment relationships in the cind’s

and test if any of the conjunctive queries in the maximally contained rewriting

contains the input queries. This would be a sound but theoretically incomplete

approach to producing maximally contained rewritings. In practice, however, it

would probably well coincide with user’s expectations. Note that this requires

that each cind in the constraints base individually expresses an equivalence rela-

tionship, and positive queries such as seen in the above example (e.g. the disjoint

partition of PhD and masters students) cannot be expressed7 .

Another problem is related to propagating updates that are stated in terms

of the logical schema into the appropriate storage structures. In the classical ap-

proach to maintaining physical data independence, where physical storage struc-

tures are defined as views over the logical schema, updating these structures is

simple, as it reduces to simply refreshing these materialized views. Under our

problem definition, however, a generalized version of the much more involved

problem of updating views is faced [BS81, FC85, AHV95].

6

Certainly, design flaws in the physical storage structures – which do not permit to insert

data or answer certain queries although this should be possible from the point of view of the

logical schema – are harder to debug if maximally contained rewritings still return nonempty

results in cases where no equivalent rewritings exist.

7

This would require a major change of framework.

7.1. PHYSICAL DATA INDEPENDENCE 121

**Finally, an issue that we have left out of consideration is that it may be useful
**

to have storage structures defined using binding patterns (that are, however,

weak in the sense that if no rewriting can be found that obeys them, the best

such rewriting – according to some cost metrics – that can be found should be

chosen). That way, indexes are special cases of such storage structures where

index keys are defined as bound [TSI94].

An interesting technique for obtaining equivalent rewritings with cind’s has

not been discussed so far. It is based on the idea of reversing the process of

computing the rewritings. In the method for computing equivalent rewritings

proposed in Chapter 5, one first attempts to obtain a contained rewriting and

then to prove it equivalent. Alternatively, one could try to obtain a subsuming

rewriting first and subsequently prove it to be contained in the input query.

This is done as follows. Let Q be the conjunctive input query and C the set of

Horn clauses obtained by normalizing the cinds. First, Q is frozen into a canonical

database I in the tradition of Example 3.6.1. Next, the consequences of the logic

program I ∪ C (where I is taken as a set of facts) are determined by bottom-

up computation. If this computation reaches a fixpoint, an equivalent rewriting

is among the queries that can be constructed from the frozen head of Q and

subsets of the facts over source predicates that are in the fixpoint of the bottom-

up computation by undoing the freezing process8 , if such a rewriting exists. An

equivalent rewriting can be determined by another bottom-up derivation (this

time in the “opposite” direction), as described in Example 5.3.1.

Example 7.1.3 Consider the query q(x, y) ← a(x, y). and the cind

{hx, yi | a(x, y)} ≡ {hx, zi | ∃y : b(x, y), c(y, z)}

**and the source schema S = {b, c}. We freeze q into the facts base {a(αx , αy )}
**

and combine it with the three Horn clauses that result from the normalization of

the above cind. Bottom-up derivation results in the fixpoint

{a(αx , αy ), b(αx , f (αx , αy )), c(f (αx , αy ), αy )}

**Only one query which satisfies the safety requirement can be constructed from
**

the head of q and a subset of the fixpoint over predicates in S, which is

q ′ (x, y) ← b(x, z), c(z, y).

**(z is the variable which replaces the function term f (αx , αy ).) Thus, q ′ ⊇ q.
**

By freezing q ′ , combining the canonical database obtained with our Horn clauses,

and refuting the body of q bottom-up, we discover that q ′ ⊆ q. Thus, q ′ is an

equivalent rewriting of q.

8

That is, variables frozen into constants are again replaced by new variables, and so are

function terms.

122 CHAPTER 7. OUTLOOK

**If we can guarantee for a restricted class of queries and cind’s that fixpoints
**

are always reached for bottom-up derivations, we have a complete algorithm

for computing equivalent rewritings that is guaranteed to terminate. One such

class is obtained by requiring all queries (both input queries and subsumer and

subsumed queries in cind’s) to be typed conjunctive queries 9 (see e.g. [AHV95]).

For constants appearing in queries one has to require an analogous typedness

property. For instance, note that in the boolean query q ← a(1, 1). the two

constants must be assumed to be from different domains and thus different (as

their attributes are of different types).

Furthermore, one has to require that attributes added or removed between two

consecutive schema versions to be consistently existentially quantified through-

out all of their appearances in cind’s between these two schema versions. This

requirement is in general too restrictive for data integration, but mirrors quite

closely the natural semantics of schema evolution.

**7.2 Rewriting Recursive Queries
**

We have shown earlier that the case of query rewriting with cyclic sets of cind’s is

undecidable. The case of finding a maximally contained rewriting of a recursive

(datalog) query with respect to an acyclic set of cind’s on the other hand can be

solved in a straightforward way. The result is again a recursive datalog program.

We use the technique from [DG97], which has been originally defined for the

problem of answering recursive queries using views in a minor generalization (we

work with an acyclic set of cind’s rather than a single flat “layer” of views). We

use the fact that for acyclic sets of cind’s, function terms cannot grow beyond

a certain finite depth during bottom-up derivation starting from the database.

This depth is bounded by the total number of function symbols available. There

is a unique finite set of all those Horn clauses whose head predicates appear in the

recursive query to be rewritten and that only have subgoals that are materialized

“source” predicates for which data are available10 .

Let us however first take the perspective of query answering by bottom-up

derivation, considering the combination of a set of (acyclic) cind’s and a recursive

query as a logic program. Clearly, large intermediate results are created (which

are constructed using function terms) that we want to avoid for efficiency reasons.

**Example 7.2.1 Let there be the recursive query
**

9

Typed conjunctive queries follow the named perspective of relational algebra, i.e., each

attribute of a relation has a name unique inside the relation. Typed conjunctive queries are

only allowed to contain equijoins, i.e., only joins between relations by attributes with the same

name.

10

This set is computed by Algorithm 5.3.3 if we omit the part that tries to rewrite the input

query with the unfolded Horn clauses that have been computed.

7.2. REWRITING RECURSIVE QUERIES 123

fy (α, fv (α, β))XX e fy (fv (α, β), γ)

XX

XXX e

E

X

fv (α, β) E

HH E

E

HH

e t HH t Ee

HH E

HH E

HH E

HE

α β

AA

s

A

A e

A

A

A

A

A

s

f (β, fv (β, γ))

t y

e

fv (β, γ)

t

e

γ XXXX

XX

e fy (fv (β, γ), γ)

**Figure 7.4: Fixpoint of the bottom-up derivation of Example 7.2.1.
**

124 CHAPTER 7. OUTLOOK

q(x, y) ← e(x, y). q(x, z) ← e(x, y), q(y, z).

which computes the transitive closure of the graph

hV = {v1 | ∃v2 : e(v1 , v2 )} ∪ {v2 | ∃v1 : e(v1 , v2 )}, E = ei

and the cinds

Σ = { {hx, zi | ∃y : e(x, y) ∧ e(y, z)} ⊇ {hx, zi | t(x, z)},

{hu, wi | ∃v : t(u, v) ∧ t(v, w)} ⊇ {hu, wi | s(u, w)} }

**where t logically represents chains of two edges and s is a source of chains of
**

four edges. Assume now that we have the database I = {s(α, β), s(β, γ)}, where

α, β, γ are constants, the nodes of our graph. By transforming Σ into normal form

and performing bottom-up derivation, we obtain the fixpoint shown as a directed

graph in Figure 7.4. There is a tuple in q for each arc in the graph11 . Those arcs

that only belong to q are drawn as dotted lines. The result of the query is the

set of arcs between non-function term nodes, i.e. {h1, 2i, h2, 3i, h1, 3i}.

**It is possible to rewrite the cind’s and the query into a single datalog query
**

such that no function terms have to be introduced during query execution.

This method is a straightforward generalization of the algorithm in [DG97] to

Horn clauses that are the unfoldings of the normalized acyclic cinds, using Algo-

rithm 5.3.3.

**Example 7.2.2 Consider again q and Σ of the previous example. The unfolding
**

of the normal form of Σ relative to the only EDB predicate e of the query is

e(x, fy (x, fv (x, y))) ← s(x, y). e(fy (x, fv (x, y)), fv (x, y)) ← s(x, y).

e(fv (x, y), fy (fv (x, y), y)) ← s(x, y). e(fy (fv (x, y), y), y) ← s(x, y).

We transform these into

eh1,fy (2,fv (3,4))i (x, x, x, y) ← s(x, y).

ehfy (1,fv (2,3)),fv (4,5)i (x, x, y, x, y) ← s(x, y).

ehfv (1,2),fy (fv (3,4),5)i (x, y, x, y, y) ← s(x, y).

ehfy (fv (1,2),3),4i (x, y, y, y) ← s(x, y).

**where the structure of the function terms produced is moved into the predicate
**

names (e.g. eh1,fy (2,fv (3,4))i ), where integers denote the index of the variable or

constant in the head atom that corresponds to the position in the function term.

The query is now transformed bottom-up, across possibly several iterations. The

result of the first iteration is

11

To save the figure from overload, the “q” arcs are not named, unlike the other arcs.

7.2. REWRITING RECURSIVE QUERIES 125

q h1,fy (2,fv (3,4))i (x1 , x2 , x3 , x4 ) ← eh1,fy (2,fv (3,4))i (x1 , x2 , x3 , x4 ).

q hfy (1,fv (2,3)),fv (4,5)i (x1 , x2 , x3 , x4 , x5 ) ← ehfy (1,fv (2,3)),fv (4,5)i (x1 , x2 , x3 , x4 , x5 ).

q hfv (1,2),fy (fv (3,4),5)i (x1 , x2 , x3 , x4 , x5 ) ← ehfv (1,2),fy (fv (3,4),5)i (x1 , x2 , x3 , x4 , x5 ).

q hfy (fv (1,2),3),4i (x1 , x2 , x3 , x4 ) ← ehfy (fv (1,2),3),4i (x1 , x2 , x3 , x4 ).

for the first rule of q and

q hfy (fv (1,2),3),fy (4,fv (5,6))i (x1 , x2 , x3 , x5 , x6 , x7 ) ←

ehfy (fv (1,2),3),4i (x1 , x2 , x3 , x4 ),

q h1,fy (2,fv (3,4))i (x4 , x5 , x6 , x7 ).

q h1,fv (2,3)i (x1 , x5 , x6 ) ←

eh1,fy (2,fv (3,4))i (x1 , x2 , x3 , x4 ),

q hfy (1,fv (2,3)),fv (4,5)i (x2 , x3 , x4 , x5 , x6 ).

q hfy (1,fv (2,3)),fy (fv (4,5),6i (x1 , x2 , x3 , x6 , x7 , x8 ) ←

ehfy (1,fv (2,3)),fv (4,5)i (x1 , x2 , x3 , x4 , x5 ),

q hfv (1,2),fy (fv (3,4),5)i (x4 , x5 , x6 , x7 , x8 ).

q hfv (1,2),3i (x1 , x2 , x6 ) ←

ehfv (1,2),fy (fv (3,4),5)i (x1 , x2 , x3 , x4 , x5 ),

q hfy (fv (1,2),3),4i (x3 , x4 , x5 , x6 ).

for the second rule. The latter four rules combines the four function-free rewrit-

ings of the unfolded Horn clauses with the rewritings of the first rule of q. In

the subsequent iterations, the results of the previous iterations are combined. It

would consume too much space to write down the full rewriting, which contains

8 more rules for q. A single one of them is the new query goal,

q h1,2i (x1 , x5 ) ← eh1,fy (2,fv (3,4))i (x1 , x2 , x3 , x4 ), q hfy (1,fv (2,3)),4i (x2 , x3 , x4 , x5 ).

**Clearly, a number of optimizations over this naive transformation are possi-
**

12

ble , for which we refer to [DG97].

**This transformation can be easily automated and is applicable for all datalog
**

queries.

12

After all, this query is equivalent to {q(x, y) ← s(x, y). q(x, z) ← s(x, y), q(y, z).} Note

that only four of the q ... predicates (q h1,2i , q hfy (1,fv (2,3)),4i , q hfv (1,2),3i , q hfy (fv (1,2),3),4i ) created

using this naive transformation are – taking a top-down perspective – reachable from the goal

predicate q h1,2i , and rules containing others may be eliminated outright.

126 CHAPTER 7. OUTLOOK

Chapter 8

Conclusions

**The approach to data integration that has been proposed in this thesis has the
**

following features:

**• The infrastructure does not rely on a “global” integration schema as un-
**

der LAV. Rather, several information systems each may need access to

integrated data from other information systems. Integration schemata may

lack sophistication or even any special preparation for source integration.

**• Integration schemata may contain both materialized database relations and
**

purely logical predicates, for which data have to be provided by means of

data integration.

**• Our approach provides good support for the creation and maintenance of
**

mappings between information systems under frequent change. This in-

cludes good decoupling of information systems through the mappings used

for integration, such that the workload imposed on the knowledge engineer

who maintains mappings when change occurs is as small as possible. The

approach at the same time permits mappings to be designed in a reasonably

natural way, thus simplifying the modeling and mapping work and at the

same time enabling the designer to express intuitions that may be useful

for anticipating future change.

**• The data integration reasoning is carried out globally, declaratively, and
**

uses an intuitive and accessible semantics. Mappings between several in-

formation systems are transitive, which reduces the amount of redundant

mapping work that has to be done.

**• Conjunctive inclusion dependencies as inter-schema constraints allow to
**

deal with concept mismatch in a wide sense. This is a necessary condition

for being able to deal with autonomous and changing integration schemata.

127

128 CHAPTER 8. CONCLUSIONS

**We have pointed out that data integration with multiple unsophisticated
**

evolving integration schemata is a problem of high relevance1 that has been insuf-

ficiently addressed so far. None of the previous work seems to be directly suitable.

Apart from management problems with respect to schemata and mappings sim-

ilar to those known from federated and multidatabases, we are confronted with

kinds of schema mismatch that require very expressive interschema constraints.

We have presented an approach based on model management and query

rewriting with expressive constraints and have discussed an architecture (Chap-

ter 4), model management operations (Chapter 6), and the issue of query rewrit-

ing (Chapter 5), a problem at the core of data integration. We have argued that

our approach supports the maintenance of the integration infrastructure by al-

lowing the modeling of mappings in a natural way and the decoupling of schemata

and mappings such that maintenance under change is simplified.

The practical feasibility of our approach has been in part shown by the imple-

mentation of the CindRew system based on the results of this thesis, and by the

benchmarks of Section 5.5. For the other part – model management – our presen-

tation was based on elementary intuitions of managing large systems that have

been widely verified and have permeated mainstream computer science thinking.

Much recent work in data integration has focussed either on procedural or

on highly structured declarative approaches meant to combine sufficient expres-

sive power with decidability (which we cannot guarantee for our approach in its

most general form). We have taken another direction, encoding a highly intuitive

class of constraints2 and providing theoretical results and an implementation for

sound best-effort query rewriting, with the intuition that practical data integra-

tion problems will often be completely solved. We have also discussed a very

important class (acyclic sets of constraints) for which we can guarantee com-

pleteness. We believe this work may be of quite immediate practical usefulness.

Plenty of material for further research has been provided in Chapter 7. A

successor project to the research that led to this thesis could be an effort to de-

velop an integrated model management and query rewriting system based on the

results presented here, however, based on an object-oriented data model. Such

a system could be of immediate usefulness to scientific communities such as the

one of high energy physics. Our query rewriting approach has an acceptability

advantage compared to other data integration approaches applicable to the set-

ting of large scientific collaborations (see Section 1.3). This is particularly true

when it comes to data integration on the Grid [FK98] with its most extensive

data volumes. Having stated this, we deem this work also a practical success,

with a clear benefit to the host of this PhD program, CERN.

1

The relevance of this work has been sufficiently argued for in Section 1.3 and Section 1.5,

and we will not reiterate this here.

2

We use conjunctive queries both in constraints and targets for rewriting. When put into a

syntax such as select-from-where queries or tableau queries [Ull88, Ull89, AHV95], conjunctive

queries can be mastered by many non-expert users.

Bibliography

**[AAA+ 97] J.L. Ambite, Y. Arens, N. Ashish, C.A. Knoblock, S. Minton,
**

J. Modi, M. Muslea, A. Philpot, W. Shen, S. Tejada, and W. Zhang.

“The SIMS Manual: Version 2.0. Working Draft”, December 1997.

[AB88] Serge Abiteboul and Catriel Beeri. “On the Power of Languages for

the Manipulation of Complex Objects”. Technical Report TR 846,

INRIA, 1988.

[ABD+ 96] Daniel E. Atkins, William P. Birmingham, Edmund H. Durfee,

Eric J. Glover, Tracy Mullen, Elke A. Rundensteiner, Elliot Soloway,

José M. Vidal, Raven Wallace, and Michael P. Wellman. “Toward

Inquiry-Based Education Through Interacting Software Agents”.

IEEE Computer, 29(5):69–76, May 1996.

[Abi97] Serge Abiteboul. “Querying Semistructured Data”. In Proc.

ICDT’97, Delphi, Greece, 1997.

[ABS00] Serge Abiteboul, Peter Buneman, and Dan Suciu. Data on the Web.

Morgan Kaufmann Publishers, 2000.

[ABU79] Alfred V. Aho, Catriel Beeri, and Jeffrey D. Ullman. “The Theory

of Joins in Relational Databases”. ACM Transactions on Database

Systems, 4(3):297–314, 1979.

[ACPS96] S. Adali, K. S. Candan, Y. Papakonstantinou, and V. S. Subrahma-

nian. “Query Caching and Optimization in Distributed Mediator

Systems”. In Proceedings of the 1996 ACM SIGMOD International

Conference on Management of Data (SIGMOD’96), pages 137–146,

Montreal, Canada, June 1996.

[AD98] Serge Abiteboul and Oliver M. Duschka. “Complexity of Answering

Queries Using Materialized Views”. In Proceedings of the ACM

SIGACT-SIGMOD-SIGART Symposium on Principles of Database

Systems (PODS) 1998, pages 254–263, 1998.

[Age] UMBC Agents Mailing List Archive

http://agents.umbc.edu/agentslist/archive/.

129

130 BIBLIOGRAPHY

**[AHV95] Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of
**

Databases. Addison-Wesley, 1995.

**[AK92] Yigal Arens and Craig A. Knoblock. “Planning and Reformulating
**

Queries for Semantically-Modeled Multidatabase Systems”. In Pro-

ceedings of the First International Conference on Information and

Knowledge Management (CIKM’92), Baltimore, MD, 1992.

**[AQM+ 97] Serge Abiteboul, Dallan Quass, Jason McHugh, Jennifer Widom,
**

and Janet L. Wiener. “The Lorel Query Language for Semistruc-

tured Data”. International Journal on Digital Libraries, 1(1):68–88,

1997.

**[AS99] Albert Alderson and Hanifa Shah. “Viewpoints on Legacy Sys-
**

tems”. Communications of the ACM, 42(3):115–116, 1999.

**[AV97a] Serge Abiteboul and Victor Vianu. “Queries and Computation on
**

the Web”. In Proc. ICDT’97, 1997.

**[AV97b] Serge Abiteboul and Victor Vianu. “Regular Path Queries with
**

Constraints”. In Proceedings of the ACM SIGACT-SIGMOD-

SIGART Symposium on Principles of Database Systems, May 11–

15, 1997, Tucson, AZ USA, 1997.

**[BB99] Philip A. Bernstein and Thomas Bergstraesser. “Meta-Data Sup-
**

port for Data Transformations Using Microsoft Repository”. IEEE

Data Engineering Bulletin, 22(1):9–14, March 1999.

**[BBB+ 97] Roberto J. Bayardo Jr., William Bohrer, Richard S. Brice, Andrzej
**

Cichocki, Jerry Fowler, Abdelsalam Helal, Vipul Kashyap, Tomasz

Ksiezyk, Gale Martin, Marian H. Nodine, Mosfeq Rashid, Marek

Rusinkiewicz, Ray Shea, C. Unnikrishnan, Amy Unruh, and Darrell

Woelk. “InfoSleuth: Agent-Based Semantic Integration of Informa-

tion in Open and Dynamic Environments”. In J. Peckham, editor,

Proceedings of the 1997 ACM SIGMOD International Conference

on Management of Data (SIGMOD’97), pages 195–206, Tucson,

Arizona, USA, May 1997. ACM Press.

**[BBMR89a] Alexander Borgida, Ronald J. Brachman, Deborah L. McGuinness,
**

and Lori A. Resnick. “CLASSIC: A Structural Data Model for

Objects”. In Proceedings of the 1989 ACM SIGMOD International

Conference on Management of Data (SIGMOD’89), pages 59–67,

June 1989.

BIBLIOGRAPHY 131

**[BBMR89b] Ronald J. Brachman, Alex Borgida, Deborah L. McGuinness, and
**

Lori A. Resnick. “The CLASSIC Knowledge Representation Sys-

tem, or, KL-ONE: The Next Generation”, February 1989.

**[BD99] Alex Borgida and Prem Devanbu. “Adding more DL to IDL: To-
**

wards more Knowledgeable Component Inter-operability”. In Proc.

of ICSE’99, 1999.

**[BDBW97] J. M. Bradshaw, S. Dutfield, P. Benoit, and J.D. Woolley. “KAoS:
**

Toward an Industrial-strength Open Agent Architecture”. In

Jeffrey M. Bradshaw, editor, Software Agents, pages 375–418.

AAAI/MIT Press, 1997.

**[BDHS96] Peter Buneman, Susan Davidson, Gerd Hillebrand, and Dan Suciu.
**

“A Query Language and Optimization Techniques for Unstructured

Data”. In Proceedings of the 1996 ACM SIGMOD International

Conference on Management of Data (SIGMOD’96), 1996.

**[BF97] Avrim L. Blum and Merrick L. Furst. “Fast Planning Through
**

Planning Graph Analysis”. Artificial Intelligence, 90:281–300, 1997.

**[BH91] Franz Baader and Bernhard Hollunder. “KRIS: Knowledge Rep-
**

resentation and Inference System”. SIGART Bulletin, 2(3):8–14,

1991.

**[BLN86] Carlo Batini, Maurizio Lenzerini, and Shamkant B. Navathe. “A
**

Comparative Analysis of Methodologies for Database Schema Inte-

gration”. ACM Computing Surveys, 18:323–364, 1986.

**[BLP00] Philip A. Bernstein, Alon Y. Levy, and Rachel A. Pottinger. “A Vi-
**

sion for Management of Complex Models”. Technical Report 2000-

53, Microsoft Research, 2000.

**[BLR97] Catriel Beeri, Alon Y. Levy, and Marie-Christine Rousset. “Rewrit-
**

ing Queries Using Views in Description Logics”. In Proceedings of

the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of

Database Systems, May 11–15, 1997, Tucson, AZ USA, pages 99–

108, 1997.

**[BM93] Elisa Bertino and Lorenzo Martino. Object-oriented Database Sys-
**

tems - Concepts and Architectures. Addison-Wesley, 1993.

**[Bor95] Alexander Borgida. “Description Logics in Data Management”.
**

IEEE Transactions on Knowledge and Data Engineering, 7(5):671–

682, October 1995.

132 BIBLIOGRAPHY

**[BPGL85] Ronald J. Brachman, V. Pigman Gilbert, and Hector J. Levesque.
**

“An Essential Hybrid Reasoning System: Knowledge and Symbol

Level Accounts in KRYPTON”. In Proceedings of the International

Joint Conference on Artificial Intelligence (IJCAI’85), pages 532–

539, 1985.

**[BPS94] Alexander Borgida and Peter F. Patel-Schneider. “A Semantics and
**

Complete Algorithm for Subsumption in the CLASSIC Description

Logic”. Journal of Artificial Intelligence Research, 1:277–308, 1994.

**[Bra83] Ronald J. Brachman. “What IS-A is and isn’t: An Analysis of
**

Taxonomic Links in Semantic Networks”. IEEE Computer, 16(10),

October 1983.

**[BS81] F. Bancilhon and N. Spyratos. “Update Semantics of Relational
**

Views”. ACM Transactions on Database Systems, 6(4):557–575,

December 1981.

**[BS85] Ronald J. Brachman and James G. Schmolze. “An Overview of the
**

KL-ONE Knowledge Representation System”. Cognitive Science,

9(2):171–216, 1985.

**[CBB+ 97] R. G. G. Cattell, Douglas K. Barry, Mark Berler, Jeff Eastman,
**

David Jordan, Craig Russell, Olaf Schadow, Torsten Stanienda, and

Fernando Velez. The Object Database Standard: ODMG 2.0. Mor-

gan Kaufmann, 1997.

**[CDL98a] Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini.
**

“On the Decidability of Query Containment under Constraints”. In

Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium

on Principles of Database Systems (PODS) 1998, pages 149–158,

1998.

**[CDL+ 98b] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini,
**

Daniele Nardi, and Riccardo Rosati. “Information Integration: Con-

ceptual Modeling and Reasoning Support”. In Proc. CoopIS’98,

pages 280–291, 1998.

**[CDL99] Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini.
**

“Answering Queries using Views in Description Logics”. In Proc.

of the 1999 Description Logic Workshop (DL’99), CEUR Workshop

Proceedings, Vol. 22, pages 9–13, 1999.

**[CDLV99] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and
**

Moshe Y. Vardi. “Rewriting of Regular Expressions and Regular

BIBLIOGRAPHY 133

**Path Queries”. In Proceedings of the ACM SIGACT-SIGMOD-
**

SIGART Symposium on Principles of Database Systems (PODS)

1999, pages 194–204, Philadelphia, PA, 1999.

**[CDLV00a] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and
**

Moshe Y. Vardi. “Answering Regular Path Queries Using Views”.

In Proceedings of the IEEE International Conference on Data En-

gineering (ICDE 2000), pages 389–398, 2000.

**[CDLV00b] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and
**

Moshe Y. Vardi. “View-Based Query Processing for Regular Path

Queries with Inverse”. In Proceedings of the ACM SIGACT-

SIGMOD-SIGART Symposium on Principles of Database Systems

(PODS) 2000, pages 58–66, Dallas, TX, 2000.

**[CH80] Ashok K. Chandra and David Harel. “Computable Queries for Re-
**

lational Data Bases”. Journal of Computer and System Sciences,

21(2):156–178, 1980.

**[CH82] Ashok K. Chandra and David Harel. “Structure and Complexity
**

of Relational Queries”. Journal of Computer and System Sciences,

25(1):99–128, 1982.

**[Cha88] Ashok K. Chandra. “Theory of Database Queries”. In Proceed-
**

ings of the 7th ACM SIGACT-SIGMOD-SIGART Symposium on

Principles of Database Systems (PODS’88), pages 1–9. ACM Press,

1988.

**[Che76] Peter Pin-Shan Chen. “The Entity-Relationship Model – Toward a
**

Unified View of Data”. ACM Transactions on Database Systems,

1(1):9–36, March 1976.

**[CHS+ 95] Michael J. Carey, Laura M. Haas, Peter M. Schwarz, Manish Arya,
**

William F. Cody, Ronald Fagin, Myron Flickner, Allen W. Lu-

niewski, Wayne Niblack, Dragutin Petkovic, John Thomas, John H.

Williams, and Edward L. Wimmers. “Towards Heterogeneous Mul-

timedia Information Systems: The Garlic Approach”. In Proceed-

ings of the Fifth International Workshop on Research Issues in Data

Engineering: Distributed Object Management (RIDE-DOM’95),

1995.

**[CJ96] D. Cockburn and N. R. Jennings. “ARCHON: A Distributed Arti-
**

ficial Intelligence System for Industrial Applications”. In G. M. P.

O’Hare and N. R. Jennings, editors, Foundations of Distributed Ar-

tificial Intelligence, pages 319–344. Wiley, 1996.

134 BIBLIOGRAPHY

**[CKM91] Jaime G. Carbonell, Craig A. Knoblock, and Steven Minton.
**

“PRODIGY: An Integrated Architecture for Planning and Learn-

ing”. In Kurt VanLehn, editor, Architectures for Intelligence, pages

241–278. Lawrence Erlbaum, Hillsdale, NJ, 1991.

**[CKPS95] S. Chaudhuri, R. Krishnamurthy, S. Potamianos, and K. Shim.
**

“Optimizing Queries with Materialized Views”. In Proceedings

of the 11th IEEE International Conference on Data Engineering

(ICDE’95), 1995.

**[CKW89] Weidong Chen, Michael Kifer, and David S. Warren. “HiLog: A
**

Foundation for Higher-Order Logic Programming”. Technical re-

port, Dept. of CS, SUNY at Stony Brook, 1989.

**[CM77] Ashok K. Chandra and Philip M. Merlin. “Optimal Implementation
**

of Conjunctive Queries in Relational Data Bases”. In Conference

Record of the Ninth Annual ACM Symposium on Theory of Com-

puting (STOC’77), pages 77–90, Boulder, Colorado, May 1977.

**[CM90] Mariano P. Consens and Alberto O. Mendelzon. “GraphLog: a
**

Visual Formalism for Real Life Recursion”. In Proceedings of the

9th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of

Database Systems (PODS’90), 1990.

[CMS95] “CMS Technical Proposal”, January 1995.

**[Cod70] E. F. Codd. “A Relational Model of Data for Large Shared Data
**

Banks”. Communications of the ACM, 13(6):377–387, June 1970.

**[Coo] International Conferences on Cooperative Information Systems,
**

1996–2001.

**[COZ00] P. Ciancarini, A. Omicini, and F. Zambonelli. “Multiagent System
**

Engineering: the Coordination Viewpoint”. In Intelligent Agents VI

- Proceedings of the 6th International Workshop on Agent Theories,

Architectures, and Languages (ATAL’99), LNAI Series, Vol. 1767.

Springer Verlag, February 2000.

**[Cro94] Kevin Crowston. “A Taxonomy Of Organizational Dependencies
**

and Coordination Mechanisms”. Technical Report 174, MIT Centre

for Coordination Science, Cambridge, MA, 1994.

**[CS93] Surajit Chaudhuri and Kyuseok Shim. “Query Optimization in the
**

Presence of Foreign Functions”. In Proceedings of the 19th Interna-

tional Conference on Very Large Data Bases (VLDB’93), Dublin,

Ireland, 1993.

BIBLIOGRAPHY 135

**[CTP00] Peter Clark, J. Thompson, and Bruce Porter. “Knowledge Pat-
**

terns”. In Proceedings of the International Conference on Principles

of Knowledge Representation and Reasoning (KR’2000), 2000.

**[CV92] Surajit Chaudhuri and Moshe Y. Vardi. “On the Equivalence of Re-
**

cursive and Nonrecursive Datalog Programs”. In Proceedings of the

11th ACM SIGACT-SIGMOD-SIGART Symposium on Principles

of Database Systems (PODS’92), pages 55–66, 1992.

**[CV94] Surajit Chaudhuri and Moshe Y. Vardi. “On the Complexity
**

of Equivalence between Recursive and Nonrecursive Datalog Pro-

grams”. In Proceedings of the ACM SIGACT-SIGMOD-SIGART

Symposium on Principles of Database Systems (PODS) 1994, pages

107–116, Minneapolis, Minnesota, May 1994.

**[CV97] Surajit Chaudhuri and Moshe Y. Vardi. “On the Equivalence of Re-
**

cursive and Nonrecursive Datalog Programs”. Journal of Computer

and System Sciences, 54(1):61–78, 1997.

[Cyc] Cycorp. “Features of CycL”. http://www.cyc.com/cycl.html.

**[Dec95] Keith Decker. “TAEMS: A Framework for Environment Centered
**

Analysis and Design of Coordination Mechanisms”. In G. O’Hare

and Nicholas Jennings, editors, Foundations of Distributed Artificial

Intelligence, chapter 16, pages 429–448. Wiley Inter-Science, 1995.

**[DEGV] Evgeny Dantsin, Thomas Eiter, Georg Gottlob, and Andrei
**

Voronkov. “Complexity and Expressive Power of Logic Program-

ming”. To appear in ACM Computing Surveys.

**[DG97] Oliver M. Duschka and Michael R. Genesereth. “Answering Recur-
**

sive Queries using Views”. In Proceedings of the ACM SIGACT-

SIGMOD-SIGART Symposium on Principles of Database Systems,

May 11–15, 1997, Tucson, AZ USA, Tucson, Arizona, 1997.

**[DGL00] Oliver M. Duschka, Michael R. Genesereth, and Alon Y. Levy. “Re-
**

cursive Query Plans for Data Integration”. Journal of Logic Pro-

gramming, 43(1):49–73, 2000.

**[DJ90] Nachum Dershowitz and Jean-Pierre Jouannaud. “Rewrite Sys-
**

tems”. In Jan van Leeuwen, editor, Handbook of Theoretical Com-

puter Science, volume 2, chapter 6, pages 243–320. Elsevier Science

Publishers B.V., 1990.

**[DL91] Edmund H. Durfee and Victor R. Lesser. “Partial Global Planning:
**

A Coordination Framework for Distributed Hypothesis Formation”.

136 BIBLIOGRAPHY

**IEEE Transactions on Systems, Man, and Cybernetics (Special Is-
**

sue on Distributed Sensor Networks), 21(5):1167–1183, September

1991.

**[DL92] Keith Decker and Victor Lesser. “Generalizing The Partial Global
**

Planning Algorithm”. International Journal on Intelligent Cooper-

ative Information Systems, 1(2):319–346, 1992.

**[DL95] Keith Decker and Victor Lesser. “Designing a Family of Coordi-
**

nation Algorithms”. In Proceedings of the First International Con-

ference on Multiagent Systems (ICMAS’95), San Francisco, June

1995. AAAI Press.

**[DL97a] Giuseppe De Giacomo and Maurizio Lenzerini. “A Uniform Frame-
**

work for Concept Definitions in Description Logics”. Journal of

Artificial Intelligence Research (JAIR), 6:87–110, 1997.

**[DL97b] Oliver M. Duschka and Alon Y. Levy. “Recursive Plans for Infor-
**

mation Gathering”. In Proceedings of the 15th International Joint

Conference on Artificial Intelligence (IJCAI’97), Nagoya, Japan,

August 1997.

**[DLNS96] Francesco Donini, Maurizio Lenzerini, Daniele Nardi, and Andrea
**

Schaerf. “Reasoning in Description Logics”. In G. Brewka, editor,

Principles of Knowledge Representation and Reasoning, Studies in

Logic, Language and Information, pages 193–238. CLSI Publica-

tions, 1996.

**[DLNS98] Francesco M. Donini, Maurizio Lenzerini, Daniele Nardi, and An-
**

drea Schaerf. “AL-log: Integrating Datalog and Description Log-

ics”. Journal of Intelligent Information Systems, 10:227–252, 1998.

**[DS83] R. Davis and R. G. Smith. “Negotiation as a Metaphor for Dis-
**

tributed Problem Solving”. Artificial Intelligence, 20(1):63–109,

January 1983.

**[DSW97] Keith Decker, Katia Sycara, and Mike Williamson. “Middle-Agents
**

for the Internet”. In Proceedings of the 15th International Joint

Conference on Artificial Intelligence (IJCAI’97), Nagoya, Japan,

1997.

**[DSW+ 99] A. J. Duineveld, R. Stoter, M. R. Weiden, B. Kenepa, and V. R.
**

Benjamins. “Wondertools? A Comparative Study of Ontological

Engineering Tools”. In Proc. Twelfth Workshop on Knowledge Ac-

quisition, Modeling and Management (KAW’99), Banff, Alberta,

Canada, October 1999.

BIBLIOGRAPHY 137

**[DV97] Evgeny Dantsin and Andrei Voronkov. “Complexity of Query An-
**

swering in Logic Databases with Complex Values”. In LFCS’97,

LNCS 1234, pages 56–66, 1997.

[Etz96] Oren Etzioni. “Moving Up the Information Food Chain: Deploying

Softbots on the World Wide Web”. In Proc. AAAI’96, 1996.

[FC85] A.L. Furtado and M.A. Casanova. “Updating Relational Views”. In

W. Kim, D.S. Reiner, and D.S. Batory, editors, Query Processing

in Database Systems. Springer-Verlag, Berlin, 1985.

[FFKL98] Mary Fernandez, Daniela Florescu, Jaewoo Kang, and Alon Levy.

“Catching the Boat with Strudel: Experiences with a Web-Site

Management System”. In Proceedings of the 1998 ACM SIGMOD

International Conference on Management of Data (SIGMOD’98),

pages 414–425, Seattle, WA, June 1998.

[FFMM94] T. Finin, R. Fritzson, D. McKay, and R. McEntire. “KQML as

an Agent Communication Language”. In Proceedings of the Third

International Conference on Information and Knowledge Manage-

ment (CIKM’94). ACM Press, November 1994.

[FFR96] A. Farquhar, R. Fikes, and J. Rice. “The Ontolingua Server: a Tool

for Collaborative Ontology Construction”. In B. Gaines, editor,

Proceedings of 10th Knowledge Acquisition for Knowledge-Based

Systems Workshop (KAW96), Banff, Canada, 1996.

[FK98] Ian Foster and Carl Kesselman, editors. The Grid: Blueprint for a

New Computing Infrastructure. Morgan Kaufmann Publishers, San

Francisco, July 1998.

[FL97] Tim Finin and Yannis Labrou. “A Proposal for a new KQML Spec-

ification”. Technical Report CS-97-03, Computer Science and Elec-

trical Engineering Department, University of Maryland Baltimore

County, Baltimore, MD 21250, February 1997.

[FLM98] Daniela Florescu, Alon Levy, and Alberto Mendelzon. “Database

Techniques for the World-Wide Web: A Survey”. SIGMOD Record,

27(3):59–74, 1998.

[FMU82] Ronald Fagin, Alberto O. Mendelzon, and Jeffrey D. Ullman. “A

Simplied Universal Relation Assumption and its Properties”. ACM

Transactions on Database Systems, 7(3):343–360, 1982.

[FN71] Richard Fikes and Nils J. Nilsson. “STRIPS: A new Approach to

the Application of Theorem Proving to Problem Solving”. Artificial

Intelligence, 2(3/4), 1971.

138 BIBLIOGRAPHY

**[FN00] Enrico Franconi and Gary Ng. “The ICOM Tool for Intelligent
**

Conceptual Modelling”. In Proc. 7th Intl. Workshop on Knowledge

Representation meets Databases (KRDB’00), Berlin, Germany, Au-

gust 2000.

**[FNPB99] Jerry Fowler, Marian Nodine, Brad Perry, and Bruce Bargmeyer.
**

“Agent-based Semantic Interoperability in InfoSleuth”. SIGMOD

Record, 28(1):60–67, 1999.

**[Fra99] Enrico Franconi, 1999. Description Logics Course Web Page. Avail-
**

able at http://www.cs.man.ac.uk/∼franconi/dl/course/.

**[FRV95] Daniela Florescu, Louiqa Raschid, and Patrick Valduriez. “Using
**

Heterogeneous Equivalences for Query Rewriting in Multidatabase

Systems”. In Proc. CoopIS’95, pages 158–169, 1995.

**[FVR96] Daniela Florescu, Patrick Valduriez, and Louiqa Raschid. “An-
**

swering Queries Using OQL View Expressions”. In Workshop on

Materialized Views in Cooperation with ACM SIGMOD, 1996.

**[GEW96] Keith Golden, Oren Etzioni, and Dan Weld. “Planning with Exe-
**

cution and Incomplete Information”. Technical Report UW-CSE-

96-01-09, Department of Computer Science and Engineering, Uni-

versity of Washington, Seattle, February 1996.

**[GF92] Michael R. Genesereth and Richard E. Fikes. “Knowledge Inter-
**

change Format, Version 3.0 Reference Manual”. Technical Re-

port Logic-92-1, Computer Science Department, Stanford Univer-

sity, 1992.

**[GG95] Nicola Guarino and Pierdaniele Giaretta. “Ontologies and Knowl-
**

edge Bases: Towards a Terminological Clarification”. In N. J. I.

Mars, editor, Towards Very Large Knowledge Bases. IOS Press,

1995.

**[GHB99] Mark Greaves, Heather Holmback, and Jeffrey M. Bradshaw.
**

“What is a Conversation Policy?”. In Mark Greaves and Jeffrey M.

Bradshaw, editors, Proceedings of the Autonomous Agents’99 Work-

shop on Specifying and Implementing Conversation Policies, pages

1–9, Seattle, Washington, May 1999.

**[GHJV94] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns.
**

Elements of Reusable Object-Oriented Software. Addison Wesley

Professional Computing Series, October 1994.

BIBLIOGRAPHY 139

**[GJ79] Michael R. Garey and David S. Johnson. Computers and Intractabil-
**

ity: A Guide to the Theory of NP-Completeness. W.H. Freeman &

Co., 1979.

**[GK94] Michael R. Genesereth and Steven P. Ketchpel. “Software Agents”.
**

Communications of the ACM, 37(7):48–53, 1994.

**[GKD97] Michael R. Genesereth, Arthur M. Keller, and Oliver M. Duschka.
**

“Infomaster: An Information Integration System”. In Proceedings of

the 1997 ACM SIGMOD International Conference on Management

of Data (SIGMOD’97), pages 539–542, 1997.

**[GMLY98] Hector Garcia-Molina, Wilburt Labio, and Jun Yang. “Expiring
**

Data in a Warehouse”. In Proceedings of the 1998 International

Conference on Very Large Data Bases (VLDB’98), 1998. Extended

version as Technical Report 1998-35, Stanford Database Group.

**[GMPQ+ 97] Hector Garcia-Molina, Yannis Papakonstantinou, Dallan Quass,
**

Anand Rajaraman, Yehoshua Sagiv, Jeffrey D. Ullman, Vasilis Vas-

salos, and Jennifer Widom. “The TSIMMIS Approach to Mediation:

Data Models and Languages”. Journal of Intelligent Information

Systems, 8(2):117–132, 1997.

**[GN87] Michael R. Genesereth and Nils J. Nilson. Logical Foundations of
**

Artificial Intelligence. Morgan Kaufmann Publishers, 1987.

**[Gru] Thomas R. Gruber. “What is an Ontology?”.
**

http://www-ksl.stanford.edu/kst/what-is-an-ontology.html.

**[Gru92] Thomas R. Gruber. “Ontolingua: A Mechanism to Support
**

Portable Ontologies”. Technical Report KSL-91-66, Stanford Uni-

versity, Knowledge Systems Laboratory, March 1992.

**[Gru93a] Thomas R. Gruber. “A Translation Approach to Portable Ontology
**

Specifications”. Technical Report KSL-92-71, Stanford University,

Knowledge Systems Laboratory, April 1993.

**[Gru93b] Thomas R. Gruber. “Toward Principles for the Design of Ontolo-
**

gies Used for Knowledge Sharing”. Technical Report KSL 93-04,

Knowledge Systems Laboratory, Stanford University, 1993.

**[Gua94] Nicola Guarino. “The Ontological Level”. In B. Smith R. Casati
**

and G. White, editors, Philosophy and the Cognitive Sciences, Vi-

enna. Hölder-Pichler-Tempsky, 1994. Invited paper presented at IV

Wittgenstein Symposium, Kirchberg, Austria, 1993.

140 BIBLIOGRAPHY

**[Gua97] Nicola Guarino. “Understanding, Building, and Using Ontologies.
**

A Commentary to ‘Using Explicit Ontologies in KBS Development’,

by van Heijst, Schreiber, and Wielinga”. International Journal of

Human and Computer Studies, 46(2/3):293–310, 1997.

**[GW00a] Nicola Guarino and Christopher A. Welty. “Identity, Unity, and
**

Individuality: Towards a Formal Toolkit for Ontological Analysis”.

In Proceedings of the European Conference on Artificial Intelligence

(ECAI-2000). IOS Press, August 2000.

**[GW00b] Nicola Guarino and Christopher A. Welty. “Ontological Analysis
**

of Taxonomic Relationships”. In International Conference on Con-

ceptual Modeling (ER 2000), pages 210–224, 2000.

**[Hal00] Alon Y. Halevy. “Theory of Answering Queries Using Views”. Sig-
**

mod Record, 29(4), December 2000.

**[HGB99] Heather Holmback, Mark Greaves, and Jeffrey Bradshaw. “Agent
**

A, Can You Pass the Salt? The Role of Pragmatics in Agent Com-

munications”, May 1999. Submitted to Autonomous Agents’99.

**[HK93] Chun-Nan Hsu and Craig A. Knoblock. “Reformulating Query
**

Plans for Multidatabase Systems”. In Proc. of the Second Inter-

national Conference on Information and Knowledge Management

(CIKM’93), pages 423–432, Washington, DC, 1993.

**[HM85] Dennis Heimbigner and Dennis McLeod. “A Federated Architec-
**

ture for Information Managment”. ACM Transactions on Office

Information Systems, 3(3):253–278, July 1985.

**[HM00] Volker Haarslev and Ralf Möller. “Expressive ABox Reasoning with
**

Number Restrictions, Role Hierarchies, and Transitively Closed

Roles”. In Fausto Giunchiglia and Bart Selman, editors, Proceed-

ings of Seventh International Conference on Principles of Knowl-

edge Representation and Reasoning (KR’2000), Breckenridge, Col-

orado, USA, April 2000.

**[Hor98] Ian Horrocks. “Using an Expressive Description Logic: FaCT or
**

Fiction?”. In A. G. Cohn, L. Schubert, and S. C. Shapiro, editors,

Principles of Knowledge Representation and Reasoning: Proceed-

ings of the Sixth International Conference (KR’98), pages 636–647.

Morgan Kaufmann Publishers, June 1998.

**[HS97] M. Huhns and M. P. Singh. “Ontologies for Agents”. E-commerce,
**

IEEE Internet Computing, 1(6):81–83, November–December 1997.

BIBLIOGRAPHY 141

**[HU79] John E. Hopcroft and Jeffrey D. Ullman. “Introduction to Automata
**

Theory, Languages, and Computation”. Addison-Wesley Publishing

Company, Reading, Massachusetts, 1979.

**[JCL+ 96] N. R. Jennings, J. Corera, I. Laresgoiti, E. H. Mamdani, F. Per-
**

riolat, P. Skarek, and L. Z. Varga. “Using ARCHON to Develop

Real-world DAI Applications for Electricity Transportation Man-

agement and Particle Accelerator Control”. IEEE Expert, 11(6),

1996.

**[Jen99] Nicholas R. Jennings. “Agent-based Computing: Promise and Per-
**

ils”. In Proceedings of the International Joint Conference on Ar-

tificial Intelligence (IJCAI’99), Stockholm, Sweden, 1999. Morgan

Kaufmann Publishers.

**[JFJ+ 96] N. R. Jennings, P. Faratin, M. J. Johnson, T. J. Norman, P. O’Brien,
**

and M. E. Wiegand. “Agent-based Business Process Management”.

International Journal of Cooperative Information Systems, 5(2 and

3):105–130, 1996.

**[JFN+ 00] N. R. Jennings, P. Faratin, T. J. Norman, P. O’Brien, B. Odgers,
**

and J. L. Alty. “Implementing a Business Process Management

System using ADEPT: A Real-World Case Study”. International

Journal of Applied Artificial Intelligence, 14(3), 2000.

**[JGJ+ 95] M. Jarke, R. Gallersdörfer, M.A. Jeusfeld, M. Staudt, and S. Eherer.
**

“ConceptBase – A Deductive Object Base for Meta Data Manage-

ment”. Journal of Intelligent Information Systems, Special Issue

on Advances in Deductive Object-Oriented Databases, 4(2):167–192,

1995.

**[JLVV00] Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, and Panos
**

Vassiliadis. Fundamentals of Data Warehouses. Springer-Verlag,

2000.

**[JNF98] N. R. Jennings, T. J. Norman, and P. Faratin. “ADEPT: An Agent-
**

based Approach to Business Process Management”. ACM SIGMOD

Record, 27(4):32–39, 1998.

**[Joh90] David S. Johnson. “A Catalog of Complexity Classes”. In Jan
**

van Leeuwen, editor, Handbook of Theoretical Computer Science,

volume 1, chapter 2, pages 67–161. Elsevier Science Publishers B.V.,

1990.

142 BIBLIOGRAPHY

**[JW00] Nicholas R. Jennings and Michael Wooldridge. “Agent-Oriented
**

Software Engineering”. In Jeffrey Bradshaw, editor, Handbook of

Agent Technology. AAAI/MIT Press, 2000.

**[Kan90] Paris C. Kanellakis. “Elements of Relational Database Theory”. In
**

Jan van Leeuwen, editor, Handbook of Theoretical Computer Sci-

ence, volume 2, chapter 17, pages 1074–1156. Elsevier Science Pub-

lishers B.V., 1990.

**[KDB98] Anthony Kosky, Susan Davidson, and Peter Buneman. “Semantics
**

of Database Transformations”. In L. Libkin and B. Thalheim, edi-

tors, Semantics of Databases. Springer LNCS 1358, February 1998.

**[Kim95] Won Kim, editor. Modern Database Systems: The Object Model,
**

Interoperability, and Beyond. Addison-Wesley, 1995.

**[KJ99] S. Kalenka and N. R. Jennings. “Socially Responsible Decision Mak-
**

ing by Autonomous Agents”. In K. Korta, E. Sosa, and X. Arrazola,

editors, Cognition, Agency and Rationality, pages 135–149. Kluwer,

1999.

**[KL89] Michael Kifer and Georg Lausen. “F-Logic: A Higher-Order Lan-
**

guage for Reasoning about Objects, Inheritance, and Scheme”. In

Proceedings of the 1989 ACM SIGMOD International Conference

on Management of Data (SIGMOD’89), pages 134–146, Portland,

OR USA, 1989.

**[Klu88] Anthony Klug. “On Conjunctive Queries Containing Inequalities”.
**

Journal of the ACM, 35(1):146–160, January 1988.

**[KS92] Henry Kautz and Bart Selman. “Planning as Satisfiability”. In Pro-
**

ceedings of the 10th European Conference on Artificial Intelligence

(ECAI’92), Vienna, August 1992.

**[KW96] Chung T. Kwok and Daniel S. Weld. “Planning to Gather Informa-
**

tion”. In Proc. AAAI’96, Portland, OR, August 1996.

**[Lev00] Alon Y. Levy. “Answering Queries Using Views: A Survey”, 2000.
**

Submitted for publication.

**[LGP+ 90] D.B. Lenat, R.V. Guha, K. Pittman, D. Pratt, and M. Shepherd.
**

“Cyc: Toward Programs with Common Sense”. Communications

of the ACM, 33(8):30–49, 1990.

**[LHC] http://lhc.web.cern.ch/lhc/.
**

BIBLIOGRAPHY 143

**[LMSS95] Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, and Divesh
**

Srivastava. “Answering Queries Using Views”. In Proceedings of

the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of

Database Systems (PODS) 1995, San Jose, CA, 1995.

**[LR96] Alon Y. Levy and Marie-Christine Rousset. “CARIN: A Represen-
**

tation Language Combining Horn Rules and Description Logics”.

In Proc. 12th European Conference of Artificial Intelligence, 1996.

**[LRO96] Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. “Query-
**

ing Heterogeneous Information Sources Using Source Descriptions”.

In Proceedings of the 1996 International Conference on Very Large

Data Bases (VLDB’96), pages 251–262, 1996.

**[LRV88] Christophe Lécluse, Philippe Richard, and Fernando Velez. “O2,
**

an Object-oriented Data Model”. In Proceedings of the 1988 ACM

SIGMOD International Conference on Management of Data (SIG-

MOD’88), pages 424–433, Chicago, IL USA, June 1988.

**[LS97] Alon Y. Levy and Dan Suciu. “Deciding Containment for Queries
**

with Complex Objects (Extended Abstract)”. In Proceedings of

the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of

Database Systems, May 11–15, 1997, Tucson, AZ USA, pages 20–

31, Tucson, Arizona, 1997.

**[LSS99] Laks V. S. Lakshmanan, Fereidoon Sadri, and Subbu N. Subrah-
**

manian. “On Efficiently Implementing SchemaSQL and a SQL

Database System”. In Proceedings of the 25th International Confer-

ence on Very Large Data Bases (VLDB’99), Edinburgh, Scotland,

1999.

**[Mae94] Pattie Maes. “Agents that Reduce Work and Information Over-
**

load”. Communications of the ACM, 37(7), July 1994.

**[MB87] R. MacGregor and R. Bates. “The LOOM Knowledge Representa-
**

tion Language”. Technical Report ISI/RS-97-188, USC/ISI, 1987.

**[MHH+ 01] R. Miller, M. Hernandez, L. Hass, L. Yan, C. Ho, R. Fagin, and
**

L. Popa. “The Clio Project: Managing Heterogeneity”. SIGMOD

Record, 30(1), March 2001.

**[MIKS00] E. Mena, A. Illarramendi, V. Kashyap, and A. Sheth. “OBSERVER:
**

An Approach for Query Processing in Global Information Systems

based on Interoperation across Pre-existing Ontologies”. Inter-

national Journal of Distributed and Parallel Databases (DAPD),

8(2):223–271, 2000.

144 BIBLIOGRAPHY

**[MKSI96] Eduardo Mena, Vipul Kashyap, Amit P. Sheth, and Arantza Il-
**

larramendi. “OBSERVER: An Approach for Query Processing in

Global Information Systems based on Interoperation across Pre-

existing Ontologies”. In Proceedings First IFCIS International Con-

ference on Cooperative Information Systems (CoopIS’96), pages 14–

25, Brussels, Belgium, June 1996. IEEE Computer Society Press.

**[MKW00] Prasenjit Mitra, Martin Kersten, and Gio Wiederhold. “A Graph-
**

Oriented Model for Articulation of Ontology Interdependencies”.

In Proceedings of the 7th International Conference on Extending

Database Technology (EDBT 2000), Konstanz, Germany, March

2000. Springer Verlag.

**[MLF00] Todd Millstein, Alon Levy, and Marc Friedman. “Query Contain-
**

ment for Data Integration Systems”. In Proceedings of the ACM

SIGACT-SIGMOD-SIGART Symposium on Principles of Database

Systems (PODS) 2000, Dallas, Texas, May 2000.

**[MMS79] David Maier, Alberto O. Mendelzon, and Yehoshua Sagiv. “Test-
**

ing Implications of Data Dependencies”. ACM Transactions on

Database Systems, 4(4):455–469, 1979.

**[MY95] Weiyi Meng and Clement Yu. “Query Processing in Multidatabase
**

Systems”. In Won Kim, editor, Modern Database Systems: The Ob-

ject Model, Interoperability, and Beyond, pages 551–572. Addison-

Wesley, 1995.

**[MZ98] Tova Milo and Sagit Zohar. “Using Schema Matching to Simplify
**

Heterogeneous Data Translation”. In Proceedings of the 1998 Inter-

national Conference on Very Large Data Bases (VLDB’98), August

1998.

**[NBN99] M. Nodine, W. Bohrer, and A. Ngu. “Semantic Brokering over Dy-
**

namic Heterogeneous Data Sources in InfoSleuth”. In Proceedings

of the 15th IEEE International Conference on Data Engineering

(ICDE’99), 1999.

**[Neb89] Bernhard Nebel. “What is Hybrid in Hybrid Representation Sys-
**

tems?”. In F. Gardin, G. Mauri, and M. G. Filippini, editors,

Proceedings of the International Symposium on Computational In-

telligence’89, pages 217–228, Amsterdam, The Netherlands, 1989.

North-Holland.

**[New82] Allen Newell. “The Knowledge Level”. Artificial Intelligence, 18:87–
**

127, 1982.

BIBLIOGRAPHY 145

**[New93] Allen Newell. “Reflections on the Knowledge Level”. Artificial In-
**

telligence, 59:31–38, 1993.

**[NPU98] M. Nodine, P. Perry, and A. Unruh. “Experience with the InfoSleuth
**

Agent Architecture”. In Proceedings of the AAAI-98 Workshop on

Software Tools for Developing Agents, 1998.

**[NU97] M. Nodine and A. Unruh. “Facilitating Open Communication in
**

Agent Systems: the InfoSleuth Infrastructure”. In Proceedings of

ATAL-97, 1997.

**[NvL88] Bernhard Nebel and Kai von Luck. Hybrid reasoning in BACK. In
**

Z. W. Ras and L. Saitta, editors, “Proceedings of the Third Interna-

tional Symposium on Methodologies for Intelligent Systems”, pages

260–269, Amsterdam, The Netherlands, 1988. North-Holland.

**[Nwa96] Hyacinth S. Nwana. “Software Agents: An Overview”. Knowledge
**

Engineering Review, 11(3):1–40, September 1996.

**[OV99] M. Tamer Özsu and Patrick Valduriez. Principles of Distributed
**

Database Systems. Prentice Hall, 1999.

**[Pap94] Christos H. Papadimitriou. Computational Complexity. Addison-
**

Wesley, 1994.

**[PGMW95] Yannis Papakonstantinou, Hector Garcia-Molina, and Jennifer
**

Widom. “Object Exchange Across Heterogeneous Information Sys-

tems”. In Proceedings of the 11th IEEE International Conference

on Data Engineering (ICDE’95), March 1995.

**[PGR98] C. Petrie, S. Goldmann, and A. Raquet. “Agent-Based Process
**

Management”. In Proc. of the International Workshop on Intelli-

gent Agents in CSCW, Deutsche Telekom, Dortmund, pages 1–17,

September 1998.

**[PHG+ 99] A. Preece, K. Hui, A. Gray, P. Marti, T. Bench-Capon, D. Jones,
**

and Z. Cui. “The KRAFT Architecture for Knowledge Fusion and

Transformation”. In Proceedings of the Nineteenth SGES Interna-

tional Conference on Knowledge Based Systems and Applied Artifi-

cial Intelligence (ES’99), Cambridge, UK, 1999.

**[PL00] Rachel Pottinger and Alon Y. Levy. “A Scalable Algorithm for
**

Answering Queries Using Views”. In Proceedings of the 26th In-

ternational Conference on Very Large Data Bases (VLDB’2000),

2000.

146 BIBLIOGRAPHY

**[PSS93] Peter F. Patel-Schneider and William Swartout. “Description Logic
**

Knowledge Representation System Specification from the KRSS

Group of the ARPA Knowledge Sharing Effort”, November 1993.

**[PV99] Yannis Papakonstantinou and Vasilis Vassalos. “Query Rewrit-
**

ing for Semistructured Data”. In Proceedings of the 1999 ACM

SIGMOD International Conference on Management of Data (SIG-

MOD’99), 1999.

**[PW92] J. S. Penberthy and D. Weld. “UCPOP: A Sound, Complete,
**

Partial-Order Planner for ADL”. In Third International Conference

on Knowledge Representation and Reasoning (KR-92), Cambridge,

MA, October 1992.

**[PWC95] C. Petrie, T. Webster, and M. Cutkowsky. “Using Pareto Optimality
**

to Coordinate Distributed Agents”. AIEDAM, 9:269–281, 1995.

**[Qia96] Xiaolei Qian. “Query Folding”. In Proceedings of the 12th IEEE
**

International Conference on Data Engineering (ICDE’96), pages

48–55, New Orleans, LA, 1996.

**[RN95] Stuart Russell and Peter Norvig. Artificial Intelligence - A Modern
**

Approach. Prentice Hall, NJ, 1995.

**[Ros99] Riccardo Rosati. “Towards Expressive KR Systems Integrating Dat-
**

alog and Description Logics: Preliminary Report”. In Proc. DL’99,

1999.

**[RS97] Mary Tork Roth and Peter Schwarz. “Don’t Scrap It, Wrap It!
**

A Wrapper Architecture for Legacy Data Sources”. In Proceedings

of the 1997 International Conference on Very Large Data Bases

(VLDB’97), 1997.

**[RSU95] Anand Rajaraman, Yehoshua Sagiv, and Jeffrey D. Ullman. “An-
**

swering Queries Using Templates with Binding Patterns”. In Pro-

ceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on

Principles of Database Systems (PODS) 1995, pages 105–112, 1995.

**[RVW99] C. M. Rood, D. Van Gucht, and F. I. Wyss. “MD-SQL: A Language
**

for Meta-data Queries over Relational Databases”. Technical Report

TR528, Dept. of CS, Indiana University, 1999.

**[RZA95] Paul Resnick, Richard Zeckhauser, and Chris Avery. “Roles for
**

Electronic Brokers”. In G.W. Brock, editor, Toward a Competitive

Telecommunication Industry, pages 289–304. Lawrence Erlbaum As-

sociates, Mahwah, NJ, 1995.

BIBLIOGRAPHY 147

**[Sar91] Y. Saraiya. “Subtree Elimination Algorithms in Deductive Data-
**

bases”. PhD thesis, Department of Computer Science, Stanford

University, January 1991.

**[SCB+ 98] I.A. Smith, P.R. Cohen, J.M. Bradshaw, M. Greaves, and H. Holm-
**

back. “Designing Conversation Policies using Joint Intention The-

ory”. In Proc. International Joint Conference on Multi-Agent Sys-

tems (ICMAS-98), Paris, France, July 1998.

**[SCH+ 97] Munindar P. Singh, Philip Cannata, Michael N. Huhns, Nigel
**

Jacobs, Tomasz Ksiezyk, KayLiang Ong, Amit P. Sheth, Chris-

tine Tomlinson, and Darrell Woelk. “The Carnot Heterogeneous

Database Project: Implemented Applications”. Distributed and

Parallel Databases, 5(2):207–225, 1997.

**[SDJL96] Divesh Srivastava, Shaul Dar, H. V. Jagadish, and Alon Y. Levy.
**

“Answering Queries with Aggregation Using Views”. In Proceedings

of the 1996 International Conference on Very Large Data Bases

(VLDB’96), pages 318–329, 1996.

**[Sea69] John R. Searle. Speech Acts: An Essay in the Philosophy of Lan-
**

guage. Cambridge University Press, Cambridge, 1969.

**[SGV99] W. Swartout, Y. Gil, and A. Valente. “Representing Capabilities of
**

Problem Solving Methods”. In Proc. IJCAI-99 Workshop on On-

tologies and Problem-Solving Methods: Lessons Learned and Future

Trends, Stockholm, Sweden, August 1999.

**[Shm87] Oded Shmueli. “Decidability and Expressiveness Aspects of
**

Logic Queries”. In Proceedings of the 6th ACM SIGACT-

SIGMOD-SIGART Symposium on Principles of Database Systems

(PODS’87), pages 237–249, 1987.

**[Sho93] Yoav Shoham. “Agent-Oriented Programming”. Artificial Intelli-
**

gence, 60(1):51–92, 1993.

**[SHWK76] Michael Stonebraker, Gerald Held, Eugene Wong, and Peter Kreps.
**

“The Design and Implementation of INGRES”. ACM Transactions

on Database Systems, 1(3):189–222, 1976.

**[Sip97] Michael F. Sipser. Introduction to the Theory of Computation. PWS
**

Publishing, 1997.

**[SL90] Amit P. Sheth and James A. Larson. “Federated Database Sys-
**

tems for Managing Distributed, Heterogeneous and Autonomous

Databases”. ACM Computing Surveys, 22(3), September 1990.

148 BIBLIOGRAPHY

**[SL95] Tuomas Sandholm and Victor Lesser. “Issues in Automated Ne-
**

gotiation and Electronic Commerce: Extending the Contract Net

Framework”. In 1st International Conference on Multiagent Sys-

tems (ICMAS), pages 328–335, San Francisco, 1995.

**[SLK98] Katia Sycara, J. Lu, and Matthias Klusch. “Interoperability among
**

Heterogeneous Software Agents on the Internet”. Technical report,

Carnegie-Mellon University, Pittsburgh, USA, 1998.

**[Smi80] Reid G. Smith. “The Contract Net Protocol: High-Level Com-
**

munication and Control in a Distributed Problem Solver”. IEEE

Transactions on Computers, 29(12):1104–1113, December 1980.

**[SPVG01] K. Sycara, M. Paolucci, M. Van Velsen, and J.A. Giampapa. “The
**

RETSINA MAS Infrastructure”. Technical Report CMU-RI-TR-

01-05, Robotics Institute, Carnegie Mellon University, March 2001.

**[SS89] Manfred Schmidt-Schauss. “Subsumption in KL-ONE is Undecid-
**

able”. In Proceedings of the 1st International Conference on Prin-

ciples of Knowledge Representation and Reasoning (KR’89), pages

421–431. Morgan Kaufmann, 1989.

**[SSS91] Manfred Schmidt-Schauss and Gert Smolka. “Attributive Concept
**

Descriptions with Complements”. Artificial Intelligence, 48(1):1–

26, 1991.

**[SY80] Yehoshua Sagiv and Mihalis Yannakakis. “Equivalences Among
**

Relational Expressions with the Union and Difference Operators”.

Journal of the ACM, 27(4):633–655, 1980.

**[TBM99] P. Tsompanopoulou, L. Bölöni, and D. C. Marinescu. “The De-
**

sign of Software Agents for a Network of PDE Solvers”. In Proc.

Workshop on Autonomous Agents in Scientific Computing at Au-

tonomous Agents 1999, pages 57–68, 1999.

**[TK78] D. Tsichritzis and A. Klug. “The ANSI/X3/SPARC DBMS Frame-
**

work”. Information Systems, 3(4), 1978.

**[TMD92] J. Thierry-Mieg and R. Durbin. “Syntactic Definitions for the
**

ACeDB Data Base Manager”, 1992.

**[TSI94] Odysseas G. Tsatalos, Marvin H. Solomon, and Yannis E. Ioannidis.
**

“The GMAP: A Versatile Tool for Physical Data Independence”.

In Proceedings of the 1994 International Conference on Very Large

Data Bases (VLDB’94), 1994.

BIBLIOGRAPHY 149

**[TYF86] Toby J. Teorey, Dongqing Yang, and James P. Fry. “A Logical
**

Design Methodology for Relational Databases using the Extended

Entity-Relationship Model”. ACM Computing Surveys, 18(2):197–

222, 1986.

**[Ull88] Jeffrey D. Ullman. Principles of Database & Knowledge-Base Sys-
**

tems Vol. 1. Computer Science Press, December 1988.

**[Ull89] Jeffrey D. Ullman. Principles of Database & Knowledge-Base Sys-
**

tems Vol. 2: The New Technologies. Computer Science Press, 1989.

**[Ull97] Jeffrey D. Ullman. “Information Integration Using Logical Views”.
**

In Proc. ICDT’97, pages 19–40, 1997.

**[Var82] Moshe Y. Vardi. “The Complexity of Relational Query Languages”.
**

In Proc. 14th Annual ACM Symposium on Theory of Computing

(STOC’82), pages 137–146, San Francisco, CA, May 1982.

**[Var97] Moshe Y. Vardi. “Why is Modal Logic so Robustly Decidable”. In
**

DIMACS Series in Discrete Mathematics and Theoretical Computer

Science 31, American Math. Society, pages 149–184, 1997.

**[vdM92] Ron van der Meyden. “The Complexity of Querying Indefinite In-
**

formation about Linearly Ordered Domains”. In Proceedings of the

11th ACM SIGACT-SIGMOD-SIGART Symposium on Principles

of Database Systems (PODS’92), pages 331–345, San Diego, June

1992. ACM Press.

**[vLNPS87] K. von Luck, B. Nebel, C. Peltason, and A. Schmiedel. “The
**

Anatomy of the BACK System”. Technical Report 41, KIT

(Künstliche Intelligenz und Textverstehen), Technical University of

Berlin, January 1987.

**[VV98] Sergei Vorobyov and Andrei Voronkov. “Complexity of Nonrecursive
**

Logic Programs with Complex Values”. In Proceedings of the ACM

SIGACT-SIGMOD-SIGART Symposium on Principles of Database

Systems (PODS) 1998, 1998.

**[WBLX00] T. Wagner, B. Benyo, V. Lesser, and P. Xuan. “Investigat-
**

ing Interactions Between Agent Conversations and Agent Control

Components”. In Frank Dignum and Mark Greaves, editors, Is-

sues in Agent Communication, Lecture Notes in Computer Science.

Springer-Verlag, Berlin, April 2000.

**[Wei99] Gerhard Weiss. Multiagent Systems: A Modern Approach to Dis-
**

tributed Artificial Intelligence. MIT Press, 1999.

150 BIBLIOGRAPHY

**[Wel99] Daniel S. Weld. “Recent Advances in AI Planning”. AI Magazine,
**

20(2):93–123, 1999.

**[Wid96] Jennifer Widom. “Integrating Heterogeneous Databases: Lazy or
**

Eager?”. ACM Computing Surveys, 28A(4), December 1996.

**[Wie92] Gio Wiederhold. “Mediators in the Architecture of Future Informa-
**

tion Systems”. IEEE Computer, 25(3):38–49, March 1992.

**[Wie96] Gio Wiederhold, editor. Intelligent Integration of Information.
**

Kluwer Academic Publishers, Boston, July 1996.

**[WJ95] Michael J. Wooldridge and Nicholas R. Jennings. “Intelligent
**

Agents: Theory and Practice”. Knowledge Engineering Review,

10(2), June 1995.

**[Wor01] World Wide Web Consortium. Semantic Web Activity Home Page,
**

2001. http://www.w3.org/2001/sw/.

**[WT98] Gerhard Wickler and Austin Tate. “Capability Representa-
**

tions for Brokering: A Survey”, November 1998. Available at

http://www.aiai.ed.ac.uk/∼oplan/cdl/.

**[YL87] H. Z. Yang and Per-Åke Larson. “Query Transformation for PSJ-
**

Queries”. In Proceedings of the 13th International Conference on

Very Large Data Bases (VLDB’87), pages 245–254, Brighton, Eng-

land, 1987.

**[YO79] C. T. Yu and M. Özsoyoglu. “An Algorithm for Tree-Query Mem-
**

bership of a Distributed Query”. In Proc. IEEE COMPSAC, pages

306–312, 1979.

**[Zan96] Carlo Zaniolo. “A Short Overview of LDL++: A Second-Generation
**

Deductive Database System”. Computational Logic, 3(1):87–93, De-

cember 1996.

**[ZHKF95a] Gang Zhou, Richard Hull, Roger King, and Jean-Claude Franchitti.
**

“Supporting Data Integration and Warehousing Using H2O”. IEEE

Data Engineering, 18(2):29–40, June 1995.

**[ZHKF95b] Gang Zhou, Richard Hull, Roger King, and Jean-Claude Franchitti.
**

“Using Object Matching and Materialization to Integrate Heteroge-

neous Databases”. In S. Laufmann, S. Spaccapietra, and T. Yokoi,

editors, Proc. of the 3rd Int. Conf. on Cooperative Information Sys-

tems (CoopIS’95), pages 4–18, Vienna, Austria, May 1995.

- Review of the Data DictionaryUploaded byCatalina Achim
- Payroll FunctionsUploaded byfahadsap
- dbms -1Uploaded byNikhil Miranda
- Camstar Semiconductor Suite courseUploaded bytakerraj
- Garrido Chapter1Uploaded byKetut Suryananda
- Modeling TechniqueUploaded byharshit
- Overview of Pricing Procedure in SAPUploaded byhalwanv
- owbpart1Uploaded byMallikarjun Rao
- Computer Assisted Construction PlanUploaded byTruong Ngoc Dung
- GAP MODELUploaded byPulkit Mehrotra
- Object Modeling TechniquesUploaded bySuresh Kumar Mukhiya
- WP5_D5.7_dissemination of Results by Publications and Open Seminars_EAPUploaded bydigger1833
- A general Semiconductor process modeling framework.pdfUploaded byLauren Chandler
- Domain ModellingUploaded byKatherine Carbajal Vega
- Cosc5336 SyllabusUploaded bySaketh Reddy
- BatUploaded byRiser
- 1999, Hulland, Item_indicator ReliabilityUploaded byRashmi Vaishya
- gupea_2077_10532_1Uploaded bydrjanakiram
- Modeling Contaminant Transport With Aerobic Biodegradation in a Shallow Water BodyUploaded byZaid Hadi
- PAPER - Service Quality - Concepts and Models[1626]Uploaded byAlejandro
- book rental systemUploaded byAyushi Rathod
- Theory of the Gaps ModelUploaded byVignesh_Mallya_8786
- Next Generation Responsive Design: A Look at Responsive Content Strategy and Content Model PersonalizationUploaded byGregory Lipman
- Interface Metaphors and Conceptual ModelsUploaded byJosé Oliveira Junior
- Managerial Reporting TutorialUploaded bySrinubabu Kilaru
- c 1943110Uploaded byRaghavendra Kamath
- ConfjnjUploaded byblesson
- Introduction to TheoryUploaded byshoba0880
- taksonomi solo and bloom.pptxUploaded byJoyce Chua
- Org Swap AnalysisUploaded byAkshay Naik

- Guiding Interactive DramaUploaded bygbvico
- ZammittoV Gamer sPersonalityAndTheirGamingPreferences MSc 2010 SFUUploaded bygbvico
- ReillyWSN Believable Social and Emotional Agents PhD CMU CS 96 138Uploaded bygbvico
- JhalaA Cinematic Discourse Generation PhD NCStateUniv 2009Uploaded bygbvico
- KelleherC Motivating Programming Using Storytelling to Make Computer Programming Attractive to Middle Schoolgirls PhD CMU CS 06 171 2006Uploaded bygbvico