Semantic Mapping in Data Integration Systems: Baladevi C

Semantic Mapping in Data Integration
Systems
Baladevi C
Dept of Computer Science &Engineering

Amrita School of Engineering
baladevic@gmail.com
Amrita Vishwa Vidyapeetham
India
28 Feb. 2017
Baladevi C 2017
OUTLINE
1 Introduction to Data Integration Systems

2 Background
3 Existing Data Integration Methods
4 Algorithms
5 Conclusion
Baladevi C 2017
Top IT Spending Priorities
Baladevi C 2017
Real World Applications
Business The Web
Science
Government
Web:Hundreds of millions of
high quality tables on the
web.
Pretty much everywhere
Baladevi C 2017
Integration in data management: Evolution
Centralized system with three-tier architecture

”Implicit” integration: integration supported by the Data
Base Management System (DBMS), i.e., the data manager
Baladevi C 2017
Centralized system with three-tier architecture and multiple

stores
Application-hidden integration:integration ”embedded” within
application
Baladevi C 2017
Problems in integrating DBs
Lot of different types of heterogeneity among several DBs to be

used together.
Different platforms:Technological heterogeneity
Different query languages:Language heterogeneity
Different data schema:Schema(semantic) heterogeneity
Errors in data, that result different values for the same info:
Instance heterogeneity
Baladevi C 2017
Centralized system with four-tier architecture and multiple,

distributed stores
(Centralized) data integration: the global schema is mapped
to the different data sources, which are heterogeneous,
distributed and autonomous
Baladevi C 2017
Data Integration[5]
What is Data Integration?

Data integration is the problem of
providing unified and transparent
view to a collection of data stored in
multiple, autonomous, and
heterogeneous data sources.
What is the importance of data
integration?
Data integration becomes increasingly
important in cases of
merging systems of two
companies
consolidating applications within
one company to provide a unified
view of the company’s data
assets.
Baladevi C 2017
Data Integration System
Baladevi C 2017
Examples of Heterogeneity
Title Author Year

DBMS Alice ’70
Computer Networks Bob ’75
Source 1:Details of book published before 1980
Name BAuthor PYear

DBMS Alice 1970
Computer Networks Bob 1975
Source 2:Details of book published before 1980
Baladevi C 2017
Formal framework for data integration
Definition

A data integration system I is a triple G,S,M , where
G is the global(mediated) schema
S is the source schema
M is the mapping between S and G
Baladevi C 2017
Data integration Architecture[2]
1.PNG
Baladevi C 2017
Data integration Architecture
Baladevi C 2017
A simple Example
Baladevi C 2017
A simple Example
Mediated Schema
Movie: Title, director, year, genre
Actors: title, actor
Plays: movie, location, startTime
Reviews: title, rating, description
select title, startTime

from Movie, Plays
where Movie.title=Plays.movie AND
location=New York AND
director=Bob
Baladevi C 2017
Challenges in DIS
1 Design of mediated schema.

Data sources might have different schema, and might export
data in different formats.
2 Translation of queries over the mediated schema to queries
over the source schema
3 Query Optimization:
No/limited statistics about data sources
4 Incomplete data sources
Data at any source might be partial, overlap with others, or
even conflict
Do we query all the data sources? Or just a few? How many?
In what order?
5 ...
Baladevi C 2017
Mapping[2]
A Logical View definition provides mapping between the global

mediated schema and the local schema.
Two basic approaches
GAV (Global As View)
LAV (Local As View)
Mapping is understanding which real data (in the data sources)
correspond to those virtual data(in mediated schema)
Baladevi C 2017
Global As View
Mediated schema as a view over the local schema.

Source Schema
Mediated Schema
DB1(id, title, actor, year)
MovieActor(title, actor)
MovieReview(title, review)
DB3(id, review)
View that provides mapping between Global Schema and Source
create view MovieActor as
select title,actor from S1.DB1
union
select title,actor from S2.DB2;
Difficult to add new sources. All existing view definitions might be
affected.
Baladevi C 2017
Local as View
Local schema as a view over the mediated schema.

Source Schema
Mediated Schema
MovieActor(title, actor)
MovieReview(title, review)
DB3(id, review)
View that provides mapping between Global Schema and Source
create view S1.DB1 as
select * from MovieActor
select * from MovieActor
select * from MovieReview
Query reformulation is harder in LAV.
Baladevi C 2017
Query Answering/Rewriting in GAV
Query
Find reviews for movies starring Bob
Query over Mediated Schema
q(title, review) : MovieActor(title, ’Bob’), MovieReview(title, re-
view).
Reformulated Query
q(title, review) : DB1(id, title, ‘Bob’, year),DB3(id, review)
q(title, review) : DB1(id, title, ‘Bob’, year), DB2(id, ‘Bob’, year),
DB3(id, review)
Baladevi C 2017
Bucket Algorithm[2]
The goal of the bucket algorithm is to reformulate a user

query that is posed on a mediated(virtual) schema into a
query that refers directly to the available data sources.
The bucket algorithm returns the maximally contained
rewriting of the query using the views.
Bucket Algorithm
1 the algorithm creates a bucket for each sub goal in Q.(ie. the
bucket contains the views (data sources) that are relevant to
answering the particular sub goal.
2 In the second step, the algorithm produce a
maximally-contained rewriting of the query using the views,
and not an equivalent rewriting.
Baladevi C 2017
An Example for Bucket algorithm[2]
Mediated Schema:
Enrolled(student, dept) Registered(student, course, year)
Course(course, number)
View of Data Sources:
V1(student,number,year) :- Registered(student,course,year),
Course(course,number), number≥ 500, year ≥ 1992.
V2(student,dept,course) :- Registered(student,course,year),
Enrolled(student,dept)
V3(student,course) :- Registered(student,course,year), year
≤ 1990.
V4(student,course,number) :- Registered(student,course,year),
Course(course,number),
Enrolled(student,dept), number ≤ 100
Baladevi C 2017
An Example for Bucket algorithm[2]
S → student, D → dept, Y → year , C → course
Query is:
q(S,D) :- Enrolled(S,D), Registered(S,C,Y), Course(C,N), N
≥ 300, Y ≥ 1995.
Bucket Formed:
Baladevi C 2017
QR Decomposition[4]
An efficient method for finding the independent attributes

and that can represent the data in its proper substructure
form without losing its semantics.
This is an efficient method for decomposing a matrix A into a
product A = QR of an orthogonal matrix Q and an upper
triangular matrix R
It is essential to remove the redundant data and bring out
only the significant data to the forefront.
Baladevi C 2017
A Simple Example
The house location attribute may or may not determine living

rooms.
This can be established by the fact that one vector(house location)
is orthogonal to other vector(living rooms).
QR decomposition is a technique to establish this fact.
QR decomposition with column pivoting to distinguish between the
independent and dependent attributes.
The next objective is to provide an integrated view of the
heterogeneous data sources with the help of a knowledge base.
Baladevi C 2017
Frequency based Coverage Statistics Mining[3]
StatMiner, a statistics mining module, estimate the coverage and

overlap statistics.
Ranking of Sources
Ranks all sources in descending order of P(S/Q).
P(S/Q) is the coverage of sources with respect to given query.
Queries are grouped into Query class using attribute and
corresponding value ofclassificatory attribute.
Query List keeps frequency of each class.
Baladevi C 2017
Frequency based Coverage Statistics Mining[3]
Probability of a query posed to mediator is:

P(Q)=FRQ /FR where,
FRQ isaccessfrequencyofaqueryQ,
FR is total frequency of all queries in Qlist
Probability that a random query posed to the mediator subsumed
by the class PC is
Pmap (C ) = QismappedtoC P(Q)
Probability that a random query belonging to the class C present
0
in a set of sources
P S0 is
0 P(S |Q)∗P(Q)
p(S |C ) = Q∈C P(C )
This is the overlap statistics w.r.t. a query class C.
If the query is overlapped in multiple source then class source
0
association rule, C→ S , givestherankofsources.
Baladevi C 2017
Problem Definition
Problem Definition
Our objective is to form an approximate view of entire data sources
at a global level in order to reduce the storage requirement at
global level and efficient retrieval of data.
Baladevi C 2017
A first Approximation
Same data model

Adoption of a global schema
The global schema will provide a Reconciled
Integrated
Virtual
view of the data sources
Baladevi C 2017
References
Xin Dong, Alon Y. Halevy, and Cong Yu. Data integration with
uncertainty.InProceedings of the 33rd International Conference on Very
Large DataBases, VLDB 07, pages 687698. VLDB Endowment, 2007.
Alon Y. Levy. Logic-based artificial intelligence. chapter Logic-based

Techniques in Data Integration, pages 575595. Kluwer Academic
Publishers,Norwell, MA, USA, 2000.
Zaiging Nie and Subbarao Kambhampati. A frequency-based approach

formining coverage statistics in data integration. InProceedings of the
20th In-ternational Conference on Data Engineering, ICDE 04, pages 387,
Wash-ington, DC, USA, 2004. IEEE Computer Society.
Harikumar Sandhya, Mekha Meriam Roy. Data Integration of

Heterogeneous Data Sources Using QR Decomposition. Springer
International Publishing Switzerland 2016.
Principles of Data Integration by ANHAI DOAN, ALON HALEVY,

ZACHARY IVES
Baladevi C 2017

Semantic Mapping in Data Integration Systems: Baladevi C

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Semantic Mapping in Data Integration Systems: Baladevi C

Uploaded by

Copyright:

Available Formats

Semantic Mapping in Data Integration

Dept of Computer Science &Engineering

1 Introduction to Data Integration Systems

Centralized system with three-tier architecture

Centralized system with three-tier architecture and multiple

Lot of different types of heterogeneity among several DBs to be

Centralized system with four-tier architecture and multiple,

What is Data Integration?

Title Author Year

Name BAuthor PYear

select title, startTime

1 Design of mediated schema.

A Logical View definition provides mapping between the global

Mediated schema as a view over the local schema.

Local schema as a view over the mediated schema.

The goal of the bucket algorithm is to reformulate a user

An efficient method for finding the independent attributes

The house location attribute may or may not determine living

StatMiner, a statistics mining module, estimate the coverage and

Probability of a query posed to mediator is:

Same data model

Alon Y. Levy. Logic-based artificial intelligence. chapter Logic-based

Zaiging Nie and Subbarao Kambhampati. A frequency-based approach

Harikumar Sandhya, Mekha Meriam Roy. Data Integration of

Principles of Data Integration by ANHAI DOAN, ALON HALEVY,

You might also like