You are on page 1of 10

Outline

CHAPTER 1: INTRODUCTION  Introduction: data integration as a new abstraction


 Examples of data integration applications
TO DATA INTEGRATION
 Schema heterogeneity
 Goal of data integration, why it’s a hard problem
 Data integration architectures

DBMS: it’s all about abstraction


Data Integration
 Logical vs. Physical; What vs. How.
 Databases are great: they let us manage huge
amounts of data Students: Takes:
SSN Name Category SSN CID
 Assuming you’ve put it all into your schema. 123-45-6789 Charles undergrad 123-45-6789 CSE444
234-56-7890 Dan grad 123-45-6789 CSE444
 In reality, data sets are often created independently … … 234-56-7890 CSE142
 Only to discover later that they need to combine their Courses: …
data! CID Name Quarter
CSE444 Databases fall
 At that point, they’re using different systems, different CSE541 Operating systems winter
schemata and have limited interfaces to their data.
 The goal of data integration: tie together different SELECT C.name
sources, controlled by many people, under a FROM Students S, Takes T, Courses C
common schema. WHERE S.name=“Mary” and
S.ssn = T.ssn and T.cid = C.cid

1
Data Integration:
A Higher-level Abstraction Outline
 Introduction: data integration as a new abstraction
Query Independence of:  Examples of data integration applications
• source & location  Schema heterogeneity
Mediated Schema • data model, syntax
• semantic variations
 Goal of data integration, why it’s a hard problem
•…  Data integration architectures
Semantic
Mappings
S1 S2 S3
<cd> <title> The best of … </title>
SSN
123-45-6789
234-56-7890
Name Category
Charles undergrad
Dan

grad

SSN
123-45-6789
123-45-6789
234-56-7890
CID
CSE444
CSE444
CSE142
… <artist> Carreras </artist>
<artist> Pavarotti </artist>

… <artist> Domingo </artist>
CID Name Quarter
CSE444 Databases fall <price> 19.95 </price>
CSE541 Operating systems winter
</cd>

Applications of Data Integration Application Area 1: Business


 Business
Enterprise Databases
 Science EII Apps:
 Government
CRM
 The Web
ERP
 Pretty much everywhere Portals
Single Mediated View

Legacy Databases
Services and Applications

50% of all IT $$$ spent here!

2
Application Area 2: Science Application Area 3: The Web
Sequenceable Structured
Phenotype Gene Experiment
Entity Vocabulary

Nucleotide Microarray
Protein
Sequence Experiment

Swiss-
OMIM HUGO GO
Prot

Gene- Locus-
Entrez GEO
Clinics Link

Hundreds of biomedical data sources available; growing rapidly!

The Deep Web


 Millions of high quality HTML forms out there
 Each form has its own special interface
 Hard to explore data across sites.
 Goal (for some domains):
 A single interface into a multitude of deep-web sources.

Hundreds of millions of high-quality


tables on the Web

3
Create a single site to search for jobs/rentals/…

Outline
 Introduction: data integration as a new abstraction
 Examples of data integration applications
 Schema heterogeneity
 Goal of data integration, why it’s a hard problem
 Data integration architectures

Easily traverse between the site by clicking its name

4
Enterprise Data Integration:
FullServe Corporation EuroCard Corporation

Employees Resumes
Employees Resumes
Employees Interview
FullTimeEmp Interview Hire
Hire CV
TempEmployees
Services Credit Cards HelpLine
Training
Services Customer Calls
Courses CustDetail
Enrollments Customers
Contracts
Sales HelpLine
Products Calls
Sales

Examples of Heterogeneity Customer Call Center

FullServe EuroCard Agents should have a full view of customer when they call
in.
FullTimeEmp Employees
ssn, empId, firstName ID, firstNameMiddleInitial,
middleName, lastName lastName Sales
Credit Cards
Hire Products
Hire Customer
empId, hireDate, recruiter Sales
ID, hireDate, recruiter CustDetail

TempEmployees
Services
ssn, hireStart, hireEnd Services
Customers
Contracts
Find all employees (making over $100K)

5
Other Reasons to Integrate Data Outline
 Create a (useful) web site for tracking services  Introduction: data integration as a new abstraction
 Collaborate with third parties  Examples of data integration applications
 E.g., create branded services  Schema heterogeneity
 Comply with government regulations  Goal of data integration, why it’s a hard problem
 Find “risky” employees  Data integration architectures
 Business intelligence
 What’s really wrong with our products?

Goal of Data Integration Why is it Hard?


 Uniform query access to a set of data sources  Systems-level reasons:
 Handle:  Managing different platforms
 Scale of sources: from tens to millions  SQL across multiple systems is not so simple
 Heterogeneity  Distributed query processing
 Autonomy  Logical reasons:
 Semi-structure  Schema (and data) heterogeneity
 ‘Social’ reasons:
 Locating and capturing relevant data in the enterprise.
 Convincing people to share (data fiefdoms)
 Security, privacy and performance implications.

6
Setting Expectations Data Integration Smorgasbord
Data integration is AI-Complete. Something for everyone:
 Completely automated solutions unlikely.  Theory of modeling data sources
 Systems aspects of data integration
Goal 1:  Architectural issues: e.g., P2P data sharing
 Reduce the effort needed to set up an integration  AI @ work: automated schema matching
application.  Web: latest on data integration & web
 Commercial products: BEA, IBM
 Semantic Web: what does it have to offer?
Goal 2:
 New trends in DBMS: uncertainty, dataspaces
 Enable the system to perform gracefully with uncertainty
(e.g., on the web)

Outline Virtual, Warehousing and in Between


 Introduction: data integration as a new abstraction  Data warehousing: integrate by bringing the data
 Examples of data integration applications into a single physical warehouse
 Schema heterogeneity  Virtual data integration: leave the data at the
 Goal of data integration, why it’s a hard problem sources and access it at query time.
 Data integration architectures
 Some differences, but semantic heterogeneity arises
in both cases.
 Numerous intermediate architectures.
 The course illustrates data integration technology
mostly through the virtual architecture.

7
Virtual Data Integration Architecture Example
Mediated Schema
or Warehouse Query reformulation/
Query over materialized data
Movie(title, director, year, genre)
Source Actors(title, actor)
descriptions/ Plays(movie, location, startTime)
Transforms Reviews(title, rating, description)

Wrapper / Wrapper / Wrapper / Wrapper /


Extractor Extractor Extractor Extractor
S1 S2 S3 S4 S5
Movies (name, Cinemas (place, CinemasInNYC CinemasInSF Reviews (title,
actors, director, movie, start) (cinema, title, (location, movie, date, grade,
RDBMS 1 RDBMS 2 startTime) startingTime) review)
genre)
HTML 1 XML 1

Wrappers Mediation Languages

<cd> <title> The best of … </title> Mediated Schema


<artist> Abiteboul </artist> Describe CD: ASIN, Title, Genre,…
<artist> Pavarotti </artist> Artist: ASIN, name, …
<artist> Domingo </artist> relationships
<price> 19.95 </price> between
</cd> logic
… mediated
schema and CDs Books
Send queries to data Album Title Authors
data sources ASIN ISBN ISBN
sources and transform Price
DiscountPrice
Price
DiscountPrice
FirstName

(Chapter 3). Studio Edition


LastName

answers into tuples (or Artists


other internal data CDCategories BookCategories
ASIN
ArtistName
ASIN ISBN GroupName
model). (Chapter 9) Category Category

8
Movie: Title, director, year, genre
Actors: title, actor
Woody Allen Comedies in NY Plays: movie, location, startTime
Reviews: title, rating, description

Mediated schema: select title, startTime


Movie: Title, director, year, genre from Movie, Plays
Actors: title, actor where Movie.title=Plays.movie AND
Plays: movie, location, startTime location=“New York” AND
Reviews: title, rating, description director=“Woody Allen”
Sources S1 and S3 are relevant, sources S4 and S5 are
select title, startTime irrelevant, and source S2 is relevant but possibly
from Movie, Plays redundant.
where Movie.title=Plays.movie AND S1 S2 S3 S4 S5
location=“New York” AND Movies: Cinemas: Cinemas in NYC: Cinemas in SF: Reviews:
name, actors, place, movie, cinema, title, location, movie, title, date
director=“Woody Allen” director, genre start startTime startingTime grade, review

Query Processing Data Warehouses – Offline Replication


Query reformulation
Query  Determine physical schema
Results
Logical query plan  Define a database with this Query

schema
Chapter 8 Query optimizer
Data Warehouse
 Define procedural mappings
Physical query plan in an “ETL tool” to import
the data and clean it.
Replanning request
Execution engine  Periodically copy all of the
data from the data sources
 Note that the sources
wrapper wrapper wrapper wrapper wrapper
and the warehouse are
basically independent at
source source source source source this point 36

9
Pros and Cons of Data Warehouses Summary of Chapter 1
 Need to spend time to design the physical database  Data integration: abstract away the fact that data
layout, as well as logical comes from multiple sources in varying schemata.
 This actually takes a lot of effort!
 Problem occurs everywhere: it’s key to business,
 Data is generally not up-to-date (lazy or offline science, Web and government.
refresh)
 Goal: reduce the effort involved in integrating.
 Queries over the warehouse don’t disrupt the data  Regardless of the architecture, heterogeneity is a key
sources issue.
 Can run very heavy-duty computations, including  Architectures range from warehousing to virtual
data mining and cleaning integration.

37

10

You might also like