|Views: 813|Likes: 0

Published by vthung

See More

See less

25

Inverting Schema Mappings

RONALD FAGINIBM Almaden Research Center

A schema mapping is a speciﬁcation that describes how data structured under one schema (thesource schema) is to be transformed into data structured under a different schema (the targetschema). Although the notion of an inverse of a schema mapping is important, the exact deﬁnitionofaninversemappingissomewhatelusive.Thisisbecauseaschemamappingmayassociatemanytarget instances with each source instance, and many source instances with each target instance.Based on the notion that the composition of a mapping and its inverse is the identity, we give aformaldeﬁnition forwhat it means for a schema mapping

M

to be an inverse of a schema mapping

M

for a class

S

of source instances. We call such an inverse an

S

-inverse

. A particular case of interest arises when

S

is the class of all source instances, in which case an

S

-inverse is a globalinverse. We focus on the important and practical case of schema mappings speciﬁed by source-to-target tuple-generating dependencies, and uncover a rich theory. When

S

is speciﬁed by a setof dependencies with a ﬁnite chase, we show how to construct an

S

-inverse when one exists. Inparticular, we show how to construct a global inverse when one exists. Given

M

and

M

, we showhow to deﬁne the largest class

S

such that

M

is an

S

-inverse of

M

.CategoriesandSubjectDescriptors:H.2.5[

DatabaseManagement

]:HeterogeneousDatabases—

Data translation

; H.2.4 [

Database Management

]: Systems—

Relational data bases

General Terms: Algorithms, Theory Additional Key Words and Phrases: Data exchange, inverse, schema mapping, data integration,chase, computational complexity, dependencies, metadata model management, second-order logic

ACM Reference Format:

Fagin, R. 2007. Inverting schema mappings. ACM Trans. Datab. Syst. 32, 4, Article 25 (November2007), 53 pages. DOI

=

10.1145/1292609.1292615 http://doi.acm.org/10.1145/1292609.1292615

1. INTRODUCTION

Data exchange is the problem of materializing an instance that adheres to atarget schema, given an instance of a source schema and a schema mappingthat speciﬁes the relationship between the source and the target. This is a veryold problem [Shu et al. 1977] that arises in many tasks where data must be

This is an expanded version of Fagin [2006]. Author’s address: IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120; email:fagin@almaden.ibm.com.Permission to make digital or hard copies of part or all of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed for proﬁt or direct commercialadvantage and that copies show this notice on the ﬁrst page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior speciﬁcpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 PennPlaza, Suite 701, New York, NY 10121-0701 USA, fax

+

1 (212) 869-0481, or permissions@acm.org.

C

2007 ACM 0362-5915/2007/11-ART25 $5.00 DOI 10.1145/1292609.1292615 http://doi.acm.org/ 10.1145/1292609.1292615

ACM Transactions on Database Systems, Vol. 32, No. 4, Article 25, Publication date: November 2007.

25:2

•

R. Fagin

transferred between independent applications that do not have the same dataformat.Because of the extensive use of schema mappings, it has become importantto develop a framework for managing schema mappings and other metadata,and operators for manipulating them. Bernstein [2003] has introduced such aframework, called

model management

. Melnik et al. [2005] have developed asemantics for model-management operators that allows applying the operatorsto executable mappings. One important schema mapping operator, at least inprinciple, is the inverse operator. What do we mean by an inverse of a schemamapping? This is a delicate question, since in spite of the traditional use of the name

mapping

, a schema mapping is not simply a function that maps aninstance of the source schema to an instance of the target schema. Instead,for each source instance, the schema mapping may associate many target in-stances. Furthermore, for each target instance, there may be many correspond-ing source instances. As in Fagin et al. [2005a, 2005b, 2005c], we study the relational case wherea schema is a sequence of distinct relational symbols. A

schema mapping

is atriple

M

=

(

S

,

T

,

), where

S

(the

source schema

) and

T

(the

target schema

)are sequences of distinct relation symbols with no relation symbols in commonand

is a set of formulas of some logical formalism over

S

,

T

. We say that

speciﬁes

theschema

M

.AsinFaginetal.[2005a,2005b,2005c],ourmainfocusisontheimportantandpracticalcaseofschemamappingswhere

isaﬁnitesetof

source-to-targettuple-generatingdependencies

(which we shall call

s-ttgds

orsimply

tgds

). These are formulas of the form

∀

x

(

ϕ

(

x

)

→ ∃

y

ψ

(

x

,

y

)), where

ϕ

(

x

)is a conjunction of atoms

1

over

S

, and where

ψ

(

x

,

y

) is a conjunction of atomsover

T

.

2

They have been used to formalize data exchange [Fagin et al. 2005a].They have also been used in data integration scenarios under the name of GLAV (global-and-local-as-view) assertions [Lenzerini 2002]. Note that tgds donotcontainequality,oranyother“built-inrelationsymbols.”Whenweconsideregds (

equality-generating dependencies

), we shall of course treat equality as abuilt-inrelationsymbolthatappearsintheconclusion.Later(inSection15),weshall extend the language of tgds so that the premise may include inequalities,and also a relation symbol

Constant

that represents constants.Intuitively, we would expect invertibility of a schema mapping to correspondto “no loss of information.” As an example, assume that the source schema hasonly the binary relation symbol

P

, and the target schema has only the unaryrelation symbol

Q

. Consider the projection schema mapping that is speciﬁed bythe s-t tgd

P

(

x

,

y

)

→

Q

(

x

).

3

It is clear that information is lost by this mapping,and, indeed, the projection schema mapping turns out not to have an inverse.Now assume that the source schema has only the binary relation symbol

P

, andthe target schema has only the ternary relation symbol

R

. Consider the schema

1

An

atom over

S

is a formula of the form

P

(

v

1

,

...

,

v

m

), where

P

is a relation symbol of

S

, and

v

1

,

...

,

v

m

are variables; similarly, we deﬁne an

atom over

T

.

2

There is also a safety condition, which says that every variable in

x

appears in

ϕ

. However, notall of the variables in

x

need to appear in

ψ

.

3

We will often drop the universal quantiﬁers in front of a tgd, and implicitly assume such quantiﬁ-cation. However, we will write down all existential quantiﬁers.

ACM Transactions on Database Systems, Vol. 32, No. 4, Article 25, Publication date: November 2007.

Inverting Schema Mappings

•

25:3

mapping that is speciﬁed by the s-t tgd

P

(

x

,

y

)

→ ∃

z

R

(

x

,

y

,

z

). It is clear thatno information is lost by this mapping, and indeed, this schema mapping turnsout to have an inverse. One such inverse is speciﬁed by the tgd that resultsby “reversing the arrows,” namely,

R

(

x

,

y

,

z

)

→

P

(

x

,

y

). However, it turns outthat “reversing the arrows” does not always produce an inverse, even when oneexists.There are other ﬂavors of “schema mappings” that have been studied in theliterature, such as view deﬁnitions, where there is a unique target instanceassociated with each source instance. In such cases, a schema mapping is afunction in the classical sense, and so it is quite clear and unambiguous as towhat an inverse mapping is. An example of such work is Hull’s [1986] seminalresearch on information capacity of relational database schemas. Although ourschema mappings are not actually functions, they have the advantage of be-ing simpler and more ﬂexible. LAV (local-as-view) mappings, which have beenwidely used in data integration, are special cases of schema mappings speciﬁedby s-t tgds, where we simply add the restriction that the premise of each tgdmust be a single atom rather than a conjunction of atoms.Let us now consider how to deﬁne the inverse in our context, where schemamappingsarenotactuallyfunctions.Letusassociatewiththeschemamapping

M

12

=

(

S

1

,

S

2

,

12

) the set

S

12

of ordered pairs

I

,

J

such that

I

is a sourceinstance,

J

isatargetinstance,andthepair

I

,

J

satisﬁes

12

(written

I

,

J

|=

12

). Perhaps the most natural deﬁnition of the inverse of the schema mapping

M

12

would be a schema mapping

M

21

that is associated with the set

S

21

={

J

,

I

:

I

,

J

∈

S

12

}

. This reﬂects the standard algebraic deﬁnition of aninverse, and is the deﬁnition that Melnik [2004] and Melnik et al. [2005] gavefortheinverse.Inthosearticles,thisdeﬁnitionwasintendedforagenericmodelmanagement context, where mappings can be deﬁned in a variety of ways,including as view deﬁnitions, relational algebra expressions, etc. However, thisdeﬁnition does not make sense in our context. This is because

S

12

, by beingassociatedwithaschemamappingspeciﬁedbys-ttgds,isautomatically“closeddown on the left and closed up on the right.” This means that if

I

,

J

∈

S

12

and if

I

⊆

I

(that is,

I

is a subinstance of

I

) and

J

⊆

J

, then

I

,

J

∈

S

12

.However, instead of being closed down on the left and closed up on the right,

S

21

is closed up on the left and closed down on the right. This is inconsistentwithaschemamappingthatisspeciﬁedbyasetofs-ttgds,whichisthecasewefocus on in this article. In fact, the “language of inverse” (that is, the languageneeded to specify inverses for schema mappings speciﬁed by s-t tgds) turns out,as we shall discuss in Section 15, to be given by a generalization of s-t tgds thatare also closed down on the left and closed up on the right.Our notion of an inverse of a schema mapping is based on another algebraicproperty of inverses, that the composition of a function with its inverse is theidentity mapping. In our context, the identity mapping is speciﬁed by tgds that“copy” the source instance to the target instance. Our deﬁnition of inverse saysthattheschemamapping

M

21

isaninverseoftheschemamapping

M

12

fortheclass

S

ofsourceinstancesiftheschemamappingspeciﬁedbytheircompositionisequivalenton

S

totheidentitymapping.Wereferthento

M

21

asan

S

-inverse

of

M

12

. When

S

is the class of all source instances, then

M

21

is said to be a

ACM Transactions on Database Systems, Vol. 32, No. 4, Article 25, Publication date: November 2007.