You are on page 1of 28

CSE 636

Data Integration

Data Integration Approaches


Virtual Integration Architecture

• Leave the data in the sources


• When a query comes in:
– Determine the relevant sources to the query
– Break down the query into sub-queries for the sources
– Get the answers from the sources, filter them if needed
and combine them appropriately
• Data is fresh
• Otherwise known as
On Demand Integration

2
Virtual Integration Architecture
Design-Time Run-Time

Mapping Tool Query 


End User
Result
Query
Reformulation
Mediation Optimization
Language & Execution
Global
Web Services Mediator Schema

1 XML

Wrapper Wrapper

Data Local Data Local


Source Schema Source Schema

3
Virtual Integration Architecture
Design-Time Run-Time

Mapping Tool Query 


End User
Result
Query
Reformulation
Mediation Optimization
Language & Execution
Global
2 Web Services Mediator Schema

1 XML

Wrapper Wrapper

Data Local Data Local


Source Schema Source Schema

4
Virtual Integration Architecture
Design-Time Run-Time

Mapping Tool Query 


End User
Result
Query
Reformulation
Mediation Optimization
3
Language & Execution
Global
2 Web Services Mediator Schema

1 XML

Wrapper Wrapper

Data Local Data Local


Source Schema Source Schema

5
Virtual Integration Architecture
Design-Time Run-Time

4 Mapping Tool Query 


End User
Result
Query
Reformulation
Mediation Optimization
3
Language & Execution
Global
2 Web Services Mediator Schema

1 XML

Wrapper Wrapper

Data Local Data Local


Source Schema Source Schema

6
Virtual Integration Architecture
Design-Time Run-Time

4 Mapping Tool Query 


End User
Result 5
Query
Reformulation
Mediation Optimization
3
Language & Execution
Global
2 Web Services Mediator Schema

1 XML

Wrapper Wrapper

Data Local Data Local


Source Schema Source Schema

7
Virtual Integration Architecture
Design-Time Run-Time

4 Mapping Tool Query 


End User
Result 5
Query
Reformulation
Mediation Optimization
3 6
Language & Execution
Global
2 Web Services Mediator Schema

1 XML

Wrapper Wrapper

Data Local Data Local


Source Schema Source Schema

8
Virtual Integration Approaches

Dimensions to Consider:
• How many sources are we accessing?
• How autonomous are they?
• Meta-data about sources?
• Is the data structured?
• Queries or also updates?
• Requirements: accuracy, completeness,
performance, handling inconsistencies.
• Closed world assumption vs. open world?

9
Mediation Languages

Global Schema

CD Artist
ASIN ASIN
Title Name
Genre …

Logic

CDs Books Authors


Album Title ISBN
ASIN ISBN FirstName
Price Price LastName
DiscountPrice DiscountPrice
Studio Edition

Artists
CDCategories BookCategories ASIN
ASIN ISBN ArtistName
Category Category GroupName

10
Desiderata from Source Descriptions

• Expressive power: distinguish between sources


with closely related data. Hence, be able to
prune access to irrelevant sources.
• Easy addition: make it easy to add new data
sources.
• Reformulation: be able to reformulate a user
query into a query on the sources efficiently
and effectively.

11
Reformulation Problem

Given:
• A query Q posed over the global schema
• Descriptions of the data sources
Find:
• A query Q’ over the data source relations, such
that:
– Q’ provides only correct answers to Q, and
– Q’ provides all possible answers from to Q given the
sources.

12
Languages for Schema Mapping

Q
Global
Mediated
MediatorSchema Schema

GAV
LAV GLAV
Q’ Q’ Q’ Q’ Q’

Source Source Source Source Source


Local Local Local Local Local
Schema Schema Schema Schema Schema

13
Global-as-View (GAV)

Global Schema:
Movie(title, dir, year, genre)
Schedule(cinema, title, time)

Integrating View:
Create View Movie AS
SELECT * FROM S1 [S1(title,dir,year,genre)]
union
SELECT * FROM S2 [S2(title,dir,year,genre)]
union
SELECT S3.title, S3.dir, S4.year, S4.genre
FROM S3, S4 [S3(title,dir),
WHERE S3.title = S4.title S4(title,year,genre)]

14
Global-as-View: Example 2

Global Schema:
Movie(title, dir, year, genre)
Schedule(cinema, title, time)

Integrating View:
Create View Movie AS
SELECT title, dir, year, NULL
FROM S1 [S1(title,dir,year)]
union
SELECT title, dir, NULL, genre
FROM S2 [S2(title,dir,genre)]

15
Global-as-View: Example 3

Global Schema:
Movie(title, dir, year, genre)
Schedule(cinema, title, time)

Integrating Views:
Create View Movie AS
SELECT NULL, NULL, NULL, genre
FROM S4 [S4(cinema, genre)]
Create View Schedule AS
SELECT cinema, NULL, NULL
FROM S4 [S4(cinema, genre)]

But what if we want to find


which cinemas are playing comedies?

16
Global-as-View Summary

• Query reformulation boils down to view


unfolding.
• Very easy conceptually.
• Can build hierarchies of global schemas.
• You sometimes loose information. Not always
natural.
• Adding sources is hard. Need to consider all
other sources that are available.

17
Local-as-View (LAV)
Create View R1 AS Create View R5 AS
SELECT B.ISBN, B.Title, A.Name SELECT B.ISBN, B.Title
FROM Book B, Author A FROM Book B
WHERE A.ISBN = B.ISBN WHERE B.Genre = ‘Humor’
AND B.Year < 1970 Mediator
Global Schema
Book Author
Books before 1970 ISBN ISBN Humor Books
Title Name
Mediated Schema
Genre
Year

Source Source Source Source Source


1 2 3 4 5
Local Schema Local Local Local Local Schema
Schema Schema Schema
R1 R5
ISBN ISBN
Title Title
Name
18
Query Reformulation

Query: Find authors of humor books


Plan: R1 Join R5
Mediator
Global Schema
Book Author
Books before 1970 ISBN ISBN Humor Books
Title Name
Mediated Schema
Genre
Year

Source Source Source Source Source


1 2 3 4 5
Local Schema Local Local Local Local Schema
Schema Schema Schema
R1 R5
ISBN ISBN
Title Title
Name
19
Query Reformulation

Query: Find authors of humor books before 1960


Plan: Can’t do it!
Mediator
Global Schema
Book Author
Books before 1970 ISBN ISBN Humor Books
Title Name
Mediated Schema
Genre
Year

Source Source Source Source Source


1 2 3 4 5
Local Schema Local Local Local Local Schema
Schema Schema Schema
R1 R5
ISBN ISBN
Title Title
Name
20
Local-as-View: Example 1

Global Schema:
Movie(title, dir, year, genre)
Schedule(cinema, title, time)

Source Views:
Create Source S1 AS [S1(title, dir, year, genre)]
SELECT * FROM Movie
Create Source S3 AS [S3(title, dir)]
SELECT title, dir FROM Movie
Create Source S5 AS [S5(title, dir, year)]
SELECT title, dir, year
FROM Movie
WHERE year > 1960 AND genre=‘Comedy’

21
Local-as-View: Example 2

Global Schema:
Movie(title, dir, year, genre)
Schedule(cinema, title, time)

Source Views:
Create Source S4 [S4(cinema, genre)]
SELECT cinema, genre
FROM Movie M, Schedule S
WHERE M.title=S.title

Now if we want to find


which cinemas are playing comedies, there is hope!

22
Local-as-View Summary

• Very flexible. You have the power of the entire


query language to define the contents of the
source.
• Hence, can easily distinguish between contents
of closely related sources.
• Adding sources is easy: they’re independent of
each other.
• Query reformulation: answering queries using
views!

23
The General Problem

• Given a set of views V1,…,Vn, and a query Q,


can we answer Q using only the answers to V1,
…,Vn?
• Many, many papers on this problem
• The best performing algorithm:
The MiniCon Algorithm
(Pottinger & Halevy, VLDB 2000)

24
Local Completeness Information

• If sources are incomplete, we need to look at


each one of them.
• Often, sources are locally complete.
• Movie(title, director, year) complete for years
after 1960, or for American directors.
• Question: given a set of local completeness
statements, is a query Q’ a complete answer to
Q?

25
Example

• Movie(title, director, year)


– complete after 1960
• Show(title, theater, city, hour)
• Query: find movies (and directors) playing in
Seattle:
SELECT M.title, M.director
FROM Movie M, Show S
WHERE M.title=S.title
AND city=‘Seattle’
• Complete or not?

26
Example #2

• Movie(title, director, year), Oscar(title, year)


• Query: find directors whose movies won Oscars
after 1965:
SELECT M.director
FROM Movie M, Oscar O
WHERE M.title=O.title
AND M.year=O.year
AND O.year > 1965
• Complete or not?

27
References

• Information integration
– Maurizio Lenzerini
– Eighteenth International Joint Conference on Artificial
Intelligence, IJCAI 2003
– Invited Tutorial

• Data Integration: a Status Report


– Alon Halevy
– German Database Conference (BTW), 2003
– Invited Talk

28

You might also like