You are on page 1of 38

Generic Model Management:

A Database Infrastructure
for Schema Manipulation

Philip A. Bernstein
Microsoft Research

Sept. 15, 2003 © 2003 Microsoft Corporation 1


Meta Data Management
 Meta data = structural information
 DB schema, interface defn, web site map, form defns, …

Forms UML Architecture


ER Diagram
Customer

Order Product

Scheduled
Delivery

Salesperson

Business
Business Rules
Process
Emp.Sal < Update
Marketing

Emp.Mgr.Sal Authorize
Credit
Order
Entry

Bill
Customer

Table Defns
Schedule
Delivery

Inventory

VB interfaces C++ interfaces

Sept. 15, 2003 © 2003 Microsoft Corporation 2


Meta Data Problems
 They all involve schemas and mappings
 E.g., data translation between data models

Hierarchical Schema Relational Schema


PO
PurchaseOrder
PO
# OrdIDOrderDate
POdate
POLines
Items
Prod#
Item#OrdIDI_Name
PName
Sept. 15, 2003 © 2003 Microsoft Corporation 3
Such Problems are Pervasive
 Data translation
 Schema evolution & data migration
 XML message translation for e-commerce
 Integrate custom apps with commercial apps
 Data warehouse loading (clean & transform)
 Design tool support (DB, UML, …)
 OO or XML wrapper generation for SQL DB
 Semantic web
Sept. 15, 2003 © 2003 Microsoft Corporation 4
Meta Data Solutions
 Solutions strongly resemble each other, but
 usually are problem-specific
 usually are language-specific
SQL, ODMG, UML, XML, RDF, ….
 usually involve a lot of object-at-a-time
programming

 Goals
 Generic solutions
 “Set”-at-a-time programming

Sept. 15, 2003 © 2003 Microsoft Corporation 5


Model Management
 A generic approach to meta data mgmt
 Model Mgmt operators manipulate
models and mappings as bulk objects
 Their representation is generic
 Operators - Match, Merge, Diff, Compose
 Avoids problem-specific and language-
specific solutions
 Avoids object-at-a-time programming
Sept. 15, 2003 © 2003 Microsoft Corporation 6
Models and Mappings
A model is a rooted directed graph, which
represents a complex information structure.
Relational map1 XSD
Emp Emp
Schema
E# E#

Dept# Dept#

Name Name
First
A mapping is a
model that Last
represents a
transformation
Sept. 15, 2003 © 2003 Microsoft Corporation 7
Models and Mappings
A model is a rooted directed graph, which
represents a complex information structure.
map1
Relational XSD
Emp Emp
Schema
E# E#

Dept# Dept#

Name Name
First
Or it could be a
binary table (a Last
morphism)
Sept. 15, 2003 © 2003 Microsoft Corporation 8
Model Mgmt Algebra
 map = Match (M1, M2)
 <M3, map13, map23> = Merge
(M1, M2, map)
 map3 = Compose(map1,
map2)
 <M2, map12> = Diff(M1, map)
 <M2, map12> = ModelGen(M1,
metamodel2)

Sept. 15, 2003 M = Copy( M )Corporation
© 2003 Microsoft 9
Outline
 Introduction to Model Management
 Using MM to solve meta data
problems
 Matching anatomy ontologies
 Model merging
 Wrap-up

Sept. 15, 2003 © 2003 Microsoft Corporation 10


Categorizing Meta Data Problems
 Model mapping

M1 map12 M2
 Data translation
 XML message translation for e-commerce
 Integrate custom apps with commercial apps
 Data warehouse loading (clean & transform)

 Solution is the match “operator”


 Really a CAD system for mapping generation

Sept. 15, 2003 © 2003 Microsoft Corporation 11


Categorizing M D Problems (2)
 Model integration

M1 map12 M2
m
ap

23
ap
13

m
M3
 View integration
 Data integration

 Solution is the Merge operator


Sept. 15, 2003 © 2003 Microsoft Corporation 12
Categorizing M D Problems (3)
 Model and mapping generation

M1 map12 M2
 Design tools (ER → SQL)
 Wrapper generation (SQL → OO or XML)

 Solution is the ModelGen operator


 <M2, map12> = ModelGen(M1,
metamodel2)

Sept. 15, 2003 © 2003 Microsoft Corporation 13


Categorizing M D Problems (4)
 Change propagation

M1 map12 M2

M1′ map12 M2′


 Schema evolution
 Required maintenance for all meta data problems

 Solution requires the rest of MM algebra

Sept. 15, 2003 © 2003 Microsoft Corporation 14


Change Propagation
 Given
 map1 between xsd1 and SQL schema rdb1
 xsd2, a modified version of xsd1
 Produce
 rdb2 to store instances of xsd2
 a mapping between xsd2 and rdb2

xsd1 map1 rdb1 1. map2= Match(xsd1, xsd2)

2. map3 = map2 • map1


1. map2

a p 3

. m 3. <map4, rdb3 > = Copy(map3)


2
Now we need to merge
map4 rdb3
xsd2 3.map rdb2 Diff(xsd2,map4) into rdb3
Sept. 15, 2003 © 2003 Microsoft Corporation 15
Change Propagation (cont’d)
4. <xsd2′, map5> =
xsd1 map1 rdb1 Diff(xsd2,map4)
5. <rdb4, map6> =
1. map2

a p 3 ModelGen(xsd2′, SQL)
. m 6. map7 = map4 • map5 • map6
2
7. <rdb2, map8, map9> =
xsd2 3. map4 rdb3 7. Merge(rdb3, rdb4, map7)
m ap
4. map5

6. map7

9
rdb2
a p8
m
xsd2′ 5. map6 rdb4 7.
Sept. 15, 2003 © 2003 Microsoft Corporation 16
Complete Script in Rondo
Operator Definition: PropagateChanges(s1, d1, s1_d1, s2, c, s2_c)
1. s1_s2 = Match(s1, s2);
2. 〈d1′, d1′_d1〉 = Delete(d1, Traverse(All(s1) − Domain(s1_s2), s1_d1));
3. 〈c′, c′_c〉 = Extract(c, Traverse(All(s2) − Range(s1_s2), s2_c));
4. c′_d1′ = c′_c ∗ Invert(s2_c) ∗ Invert(s1_s2) ∗ s1_d1 ∗ Invert(d1′_d1);
5. 〈d2, c′_d2, d1′_d2〉 = Merge(c′, d1′, c′_d1′);
• s2_d2 = s2_c ∗ Invert(c′_c) ∗ c′_d2 +
Invert(s1_s2) ∗ s1_d1 ∗ Invert(d1′_d1) ∗ d1′_d2;
7. return 〈d2, s2_d2〉;

Operator Use:
SQLXSD: PropagateChanges(s1, d1, s1_d1, s2, ModelGen(s2, XSD));

Sept. 15, 2003 © 2003 Microsoft Corporation 17


Status Report
 Previous scenario is executable in Rondo,
the first complete MM prototype
 [Melnik et al, SIGMOD 2003]
 There are many prototypes for Match
 [Rahm & Bernstein, VLDB J., Dec. 2001]
 Detailed design for Merge
 [Pottinger & Bernstein, VLDB 2003]
 There are several efforts on a formal
semantics for MM operators

Sept. 15, 2003 © 2003 Microsoft Corporation 18


Outline
 Introduction to Model Management
 Using MM to solve meta data
problems
 Matching anatomy ontologies
 Model merging
 Wrap-up

Sept. 15, 2003 © 2003 Microsoft Corporation 19


Schema Matching Algorithms
 About a dozen published algorithms
 Schema-based vs. content-based
 Per-element vs. structural
 Linguistic vs. constraint-based
 Independently-developed schemas vs.
incrementally-modified schemas
 Hybrid vs. composite
 Many good ideas, but none are robust
 Human review and input is essential
 User interface is also quite important

Sept. 15, 2003 © 2003 Microsoft Corporation 20


Matching Anatomy Ontologies
 Match two human anatomy ontologies
 FMA – Univ. of Washington
 Galen CRM – Univ. of Manchester (UK)
 By Peter Mork (Univ. of Washington)
 Both models are big

 Ultimate goal was finding differences


 Like most match algorithms, ours
calculates a similarity score for the
m× n pairs of elements

Sept. 15, 2003 © 2003 Microsoft Corporation 21


Aligning Representations
FMA: generic
part

generic Cardiac Cardiac


Heart Heart
part valve valve

CRM:
Heart h-S-C
Heart sensibly
hasStructuralComponent
ValveInHeart
Valve In
sensibly
Heart

Sept. 15, 2003 © 2003 Microsoft Corporation 22


Anatomy Matching Algorithm
1. Lexical Match
• Normalize string, UMLS dictionary lookup,
convert to concept-ID from thesaurus

• String comparison → 306 matches


• Adding spaces, ignoring case → 1834 matches
• Lexical tools → 3503 matches

Sept. 15, 2003 © 2003 Microsoft Corporation 23


Anatomy Matching Example
S: similarity score
generic
S = 2/15
part

Cardiac
Heart
valve

S=1

Heart h-S-C S=1

Valve In
sensibly
Heart
Sept. 15, 2003 © 2003 Microsoft Corporation 24
Anatomy Matching Algorithm
1. Lexical Match
• Normalize string, UMLS dictionary lookup,
convert to concept-ID from thesaurus
2. Structure Match
• Similarity(reified nodes)
= Average(neighbors)
• Back-propagate to neighbors

• Adds 64 matches (to previous 3503)


• Implies 875 reified relationship matches
Sept. 15, 2003 © 2003 Microsoft Corporation 25
Anatomy Matching Example
S: similarity score
generic
S = 2/15
part

Cardiac
Heart
valve

S=1
S = 2/3
Heart h-S-C S=1

Valve In
sensibly
Heart
Sept. 15, 2003 © 2003 Microsoft Corporation 26
Anatomy Matching Algorithm
1. Lexical Match
• Normalize string, UMLS dictionary lookup,
convert to concept-ID from thesaurus
2. Structure Match
• Similarity(reified nodes)
= Average(neighbors)
• Back-propagate to neighbors
3. Align Super-classes
• Super-class similarity = average similarity of
children, grandchildren, great-grandchildren
• Adds 213 matches (to 3567)
Sept. 15, 2003 © 2003 Microsoft Corporation 27
Some Lessons
 A common encoding of models is hard
and involves compromises
 Different styles of reifying relationships
 CRM stores transitive relationships
 Match needs to invent generalizations
 In FMA, arterial supply, venous drainage,
nerve supply, lymphatic drainage
 In CRM, these all map to isServedBy
 On big models, Match is expensive
 Some steps required days to execute
 Cross-product filled 80 GB (< 1GB input).
Sept. 15, 2003 © 2003 Microsoft Corporation 28
Outline
 Introduction to Model Management
 Using MM to solve meta data
problems
 Matching anatomy ontologies
 Model merging
 Wrap-up

Sept. 15, 2003 © 2003 Microsoft Corporation 29


Merge(M1, M2, map)
 Return the union of models M1 and M2
 Use map to guide the Merge
 If elements x = y in map, then collapse
them into one element


Emp map Emp Emp

AddrName = NamePhone Addr NamePhone

Sept. 15, 2003 © 2003 Microsoft Corporation 30


Merge(M1, M2, map)
 [Buneman, Davidson, Kosky, EDBT 92]
 Meta-model has aggregation & generalization only
 Union, and collapse objects having the same name
 Fix-up step for inconsistencies created by merging

X X X Y X Z
a a a a a
Y Z Y Z W
 Successive fixups lead to different results 
 Batch them at the end, to get a unique minimal result
 Now enrich the meta-model (containment, complex
mappings, …) & merge semantics (conflicts, deletes)
Sept. 15, 2003 © 2003 Microsoft Corporation 31
Resolving Merge Conflicts
Meta Meta Emp mapee Employee
Model
Emp# 1 EmployeeID
Conflict
Name 2 FirstName
Model
Conflict 3 4 LastName

5 8
Emp
6 9
Emp#
7 10
Meta Name
Model 11
Conflict FirstNameLastName
Sept. 15, 2003 © 2003 Microsoft Corporation 32
Contributions to Merge
[Pottinger & Bernstein, VLDB 03]
 Generic correctness criteria for Merge
 Use of first-class input mapping (not just
correspondences)
 Taxonomy of conflicts & resolution strategies
 Characterize when Merge can be automatic
 A merge algorithm for an EER representation
 Experimental evaluation

Sept. 15, 2003 © 2003 Microsoft Corporation 33


An Approach to ModelGen
[Atzeni & Torlone, EDBT '96]

 Meta-models are made of patterns


Object has Aggregation Aggregation
sub-object has attributes has key
(a) (b) (c)
 Define pattern transformations as rules
 For XSD→SQL,
+

 To translate Ms into meta-modelt (MMt),


 Apply rules that replace patterns in MS that are
not in MMt by patterns that are in MMt
Sept. 15, 2003 © 2003 Microsoft Corporation 34
ModelGen Research

 More complete repertoire of patterns


 Make patterns more generic
 Integrate with rules engine (avoid cycles,
control search)
 Implement it

Sept. 15, 2003 © 2003 Microsoft Corporation 35


What Next?
 Add semantics to mappings
 Thorough formal semantics of operators
 Industrial strength schema matching
 More and bigger applications
 More prototypes
 More operators
 Better user interfaces

Sept. 15, 2003 © 2003 Microsoft Corporation 36


References
 http://www.research.microsoft.com/~philbe
 Overview
 Bernstein, CIDR 2003
 Bernstein, Halevy, & Pottinger, SIGMOD Record, Dec. 2000
 Implementation
 Melnik, Rahm, & Bernstein, SIGMOD 2003
 Data Warehouse Examples
 Bernstein & Rahm, ER 2000
 Match Operation
 Survey: Rahm & Bernstein , VLDB J., Dec. 2001
 Prototype: Madhavan, Bernstein, & Rahm, VLDB 2001
 Merge Operation
 Pottinger & Bernstein, VLDB 200337
 Theory
 Alagić & Bernstein, DBPL 2001
 Madhavan et al, AAAI 2002
Sept. 15, 2003 © 2003 Microsoft Corporation 37
Sept. 15, 2003 © 2003 Microsoft Corporation 38

You might also like