You are on page 1of 31

Semantic Mapping in Data Integration

Systems

Baladevi C

Dept of Computer Science &Engineering


Amrita School of Engineering
baladevic@gmail.com
Amrita Vishwa Vidyapeetham
India

28 Feb. 2017

Baladevi C 2017
OUTLINE

1 Introduction to Data Integration Systems


2 Background
3 Existing Data Integration Methods
4 Algorithms
5 Conclusion

Baladevi C 2017
Top IT Spending Priorities

Baladevi C 2017
Real World Applications
Business The Web

Science

Government
Web:Hundreds of millions of
high quality tables on the
web.
Pretty much everywhere

Baladevi C 2017
Integration in data management: Evolution

Centralized system with three-tier architecture


”Implicit” integration: integration supported by the Data
Base Management System (DBMS), i.e., the data manager

Baladevi C 2017
Integration in data management: Evolution

Centralized system with three-tier architecture and multiple


stores
Application-hidden integration:integration ”embedded” within
application

Baladevi C 2017
Problems in integrating DBs

Lot of different types of heterogeneity among several DBs to be


used together.
Different platforms:Technological heterogeneity
Different query languages:Language heterogeneity
Different data schema:Schema(semantic) heterogeneity
Errors in data, that result different values for the same info:
Instance heterogeneity

Baladevi C 2017
Integration in data management: Evolution

Centralized system with four-tier architecture and multiple,


distributed stores
(Centralized) data integration: the global schema is mapped
to the different data sources, which are heterogeneous,
distributed and autonomous
Baladevi C 2017
Data Integration[5]

What is Data Integration?


Data integration is the problem of
providing unified and transparent
view to a collection of data stored in
multiple, autonomous, and
heterogeneous data sources.
What is the importance of data
integration?
Data integration becomes increasingly
important in cases of
merging systems of two
companies
consolidating applications within
one company to provide a unified
view of the company’s data
assets.

Baladevi C 2017
Data Integration System

Baladevi C 2017
Examples of Heterogeneity

Title Author Year


DBMS Alice ’70
Computer Networks Bob ’75
Source 1:Details of book published before 1980

Name BAuthor PYear


DBMS Alice 1970
Computer Networks Bob 1975
Source 2:Details of book published before 1980

Baladevi C 2017
Formal framework for data integration
Definition


A data integration system I is a triple G,S,M , where
G is the global(mediated) schema
S is the source schema
M is the mapping between S and G

Baladevi C 2017
Data integration Architecture[2]

1.PNG

Baladevi C 2017
Data integration Architecture

Baladevi C 2017
A simple Example

Baladevi C 2017
A simple Example

Mediated Schema
Movie: Title, director, year, genre
Actors: title, actor
Plays: movie, location, startTime
Reviews: title, rating, description

select title, startTime


from Movie, Plays
where Movie.title=Plays.movie AND
location=New York AND
director=Bob

Baladevi C 2017
Challenges in DIS

1 Design of mediated schema.


Data sources might have different schema, and might export
data in different formats.
2 Translation of queries over the mediated schema to queries
over the source schema
3 Query Optimization:
No/limited statistics about data sources
4 Incomplete data sources
Data at any source might be partial, overlap with others, or
even conflict
Do we query all the data sources? Or just a few? How many?
In what order?
5 ...

Baladevi C 2017
Mapping[2]

A Logical View definition provides mapping between the global


mediated schema and the local schema.
Two basic approaches
GAV (Global As View)
LAV (Local As View)
Mapping is understanding which real data (in the data sources)
correspond to those virtual data(in mediated schema)

Baladevi C 2017
Global As View

Mediated schema as a view over the local schema.


Source Schema
Mediated Schema
DB1(id, title, actor, year)
MovieActor(title, actor)
DB2(id, title, actor, year)
MovieReview(title, review)
DB3(id, review)
View that provides mapping between Global Schema and Source
create view MovieActor as
select title,actor from S1.DB1
union
select title,actor from S2.DB2;
Difficult to add new sources. All existing view definitions might be
affected.

Baladevi C 2017
Local as View

Local schema as a view over the mediated schema.


Source Schema
Mediated Schema
DB1(id, title, actor, year)
MovieActor(title, actor)
DB2(id, title, actor, year)
MovieReview(title, review)
DB3(id, review)
View that provides mapping between Global Schema and Source
create view S1.DB1 as
select * from MovieActor
create view S2.DB2 as
select * from MovieActor
create view S3.DB3 as
select * from MovieReview
Query reformulation is harder in LAV.

Baladevi C 2017
Query Answering/Rewriting in GAV

Query
Find reviews for movies starring Bob
Query over Mediated Schema
q(title, review) : MovieActor(title, ’Bob’), MovieReview(title, re-
view).
Reformulated Query
q(title, review) : DB1(id, title, ‘Bob’, year),DB3(id, review)
q(title, review) : DB1(id, title, ‘Bob’, year), DB2(id, ‘Bob’, year),
DB3(id, review)

Baladevi C 2017
Bucket Algorithm[2]

The goal of the bucket algorithm is to reformulate a user


query that is posed on a mediated(virtual) schema into a
query that refers directly to the available data sources.
The bucket algorithm returns the maximally contained
rewriting of the query using the views.
Bucket Algorithm
1 the algorithm creates a bucket for each sub goal in Q.(ie. the
bucket contains the views (data sources) that are relevant to
answering the particular sub goal.
2 In the second step, the algorithm produce a
maximally-contained rewriting of the query using the views,
and not an equivalent rewriting.

Baladevi C 2017
An Example for Bucket algorithm[2]

Mediated Schema:
Enrolled(student, dept) Registered(student, course, year)
Course(course, number)
View of Data Sources:
V1(student,number,year) :- Registered(student,course,year),
Course(course,number), number≥ 500, year ≥ 1992.
V2(student,dept,course) :- Registered(student,course,year),
Enrolled(student,dept)
V3(student,course) :- Registered(student,course,year), year
≤ 1990.
V4(student,course,number) :- Registered(student,course,year),
Course(course,number),
Enrolled(student,dept), number ≤ 100

Baladevi C 2017
An Example for Bucket algorithm[2]
S → student, D → dept, Y → year , C → course

Query is:
q(S,D) :- Enrolled(S,D), Registered(S,C,Y), Course(C,N), N
≥ 300, Y ≥ 1995.
Bucket Formed:

Baladevi C 2017
QR Decomposition[4]

An efficient method for finding the independent attributes


and that can represent the data in its proper substructure
form without losing its semantics.
This is an efficient method for decomposing a matrix A into a
product A = QR of an orthogonal matrix Q and an upper
triangular matrix R
It is essential to remove the redundant data and bring out
only the significant data to the forefront.

Baladevi C 2017
A Simple Example

The house location attribute may or may not determine living


rooms.
This can be established by the fact that one vector(house location)
is orthogonal to other vector(living rooms).
QR decomposition is a technique to establish this fact.
QR decomposition with column pivoting to distinguish between the
independent and dependent attributes.
The next objective is to provide an integrated view of the
heterogeneous data sources with the help of a knowledge base.

Baladevi C 2017
Frequency based Coverage Statistics Mining[3]

StatMiner, a statistics mining module, estimate the coverage and


overlap statistics.
Ranking of Sources
Ranks all sources in descending order of P(S/Q).
P(S/Q) is the coverage of sources with respect to given query.
Queries are grouped into Query class using attribute and
corresponding value ofclassificatory attribute.
Query List keeps frequency of each class.

Baladevi C 2017
Frequency based Coverage Statistics Mining[3]

Probability of a query posed to mediator is:


P(Q)=FRQ /FR where,
FRQ isaccessfrequencyofaqueryQ,
FR is total frequency of all queries in Qlist
Probability that a random query posed to the mediator subsumed
by the class PC is
Pmap (C ) = QismappedtoC P(Q)
Probability that a random query belonging to the class C present
0
in a set of sources
P S0 is
0 P(S |Q)∗P(Q)
p(S |C ) = Q∈C P(C )
This is the overlap statistics w.r.t. a query class C.
If the query is overlapped in multiple source then class source
0
association rule, C→ S , givestherankofsources.

Baladevi C 2017
Problem Definition

Problem Definition
Our objective is to form an approximate view of entire data sources
at a global level in order to reduce the storage requirement at
global level and efficient retrieval of data.

Baladevi C 2017
A first Approximation

Same data model


Adoption of a global schema
The global schema will provide a Reconciled
Integrated
Virtual
view of the data sources

Baladevi C 2017
References

Xin Dong, Alon Y. Halevy, and Cong Yu. Data integration with
uncertainty.InProceedings of the 33rd International Conference on Very
Large DataBases, VLDB 07, pages 687698. VLDB Endowment, 2007.

Alon Y. Levy. Logic-based artificial intelligence. chapter Logic-based


Techniques in Data Integration, pages 575595. Kluwer Academic
Publishers,Norwell, MA, USA, 2000.

Zaiging Nie and Subbarao Kambhampati. A frequency-based approach


formining coverage statistics in data integration. InProceedings of the
20th In-ternational Conference on Data Engineering, ICDE 04, pages 387,
Wash-ington, DC, USA, 2004. IEEE Computer Society.

Harikumar Sandhya, Mekha Meriam Roy. Data Integration of


Heterogeneous Data Sources Using QR Decomposition. Springer
International Publishing Switzerland 2016.

Principles of Data Integration by ANHAI DOAN, ALON HALEVY,


ZACHARY IVES

Baladevi C 2017

You might also like