Professional Documents
Culture Documents
1
Data Integration:
A Higher-level Abstraction Outline
Introduction: data integration as a new abstraction
Query Independence of: Examples of data integration applications
• source & location Schema heterogeneity
Mediated Schema • data model, syntax
• semantic variations
Goal of data integration, why it’s a hard problem
•… Data integration architectures
Semantic
Mappings
S1 S2 S3
<cd> <title> The best of … </title>
SSN
123-45-6789
234-56-7890
Name Category
Charles undergrad
Dan
…
grad
…
SSN
123-45-6789
123-45-6789
234-56-7890
CID
CSE444
CSE444
CSE142
… <artist> Carreras </artist>
<artist> Pavarotti </artist>
…
… <artist> Domingo </artist>
CID Name Quarter
CSE444 Databases fall <price> 19.95 </price>
CSE541 Operating systems winter
</cd>
Legacy Databases
Services and Applications
2
Application Area 2: Science Application Area 3: The Web
Sequenceable Structured
Phenotype Gene Experiment
Entity Vocabulary
Nucleotide Microarray
Protein
Sequence Experiment
Swiss-
OMIM HUGO GO
Prot
Gene- Locus-
Entrez GEO
Clinics Link
3
Create a single site to search for jobs/rentals/…
Outline
Introduction: data integration as a new abstraction
Examples of data integration applications
Schema heterogeneity
Goal of data integration, why it’s a hard problem
Data integration architectures
4
Enterprise Data Integration:
FullServe Corporation EuroCard Corporation
Employees Resumes
Employees Resumes
Employees Interview
FullTimeEmp Interview Hire
Hire CV
TempEmployees
Services Credit Cards HelpLine
Training
Services Customer Calls
Courses CustDetail
Enrollments Customers
Contracts
Sales HelpLine
Products Calls
Sales
FullServe EuroCard Agents should have a full view of customer when they call
in.
FullTimeEmp Employees
ssn, empId, firstName ID, firstNameMiddleInitial,
middleName, lastName lastName Sales
Credit Cards
Hire Products
Hire Customer
empId, hireDate, recruiter Sales
ID, hireDate, recruiter CustDetail
TempEmployees
Services
ssn, hireStart, hireEnd Services
Customers
Contracts
Find all employees (making over $100K)
5
Other Reasons to Integrate Data Outline
Create a (useful) web site for tracking services Introduction: data integration as a new abstraction
Collaborate with third parties Examples of data integration applications
E.g., create branded services Schema heterogeneity
Comply with government regulations Goal of data integration, why it’s a hard problem
Find “risky” employees Data integration architectures
Business intelligence
What’s really wrong with our products?
6
Setting Expectations Data Integration Smorgasbord
Data integration is AI-Complete. Something for everyone:
Completely automated solutions unlikely. Theory of modeling data sources
Systems aspects of data integration
Goal 1: Architectural issues: e.g., P2P data sharing
Reduce the effort needed to set up an integration AI @ work: automated schema matching
application. Web: latest on data integration & web
Commercial products: BEA, IBM
Semantic Web: what does it have to offer?
Goal 2:
New trends in DBMS: uncertainty, dataspaces
Enable the system to perform gracefully with uncertainty
(e.g., on the web)
7
Virtual Data Integration Architecture Example
Mediated Schema
or Warehouse Query reformulation/
Query over materialized data
Movie(title, director, year, genre)
Source Actors(title, actor)
descriptions/ Plays(movie, location, startTime)
Transforms Reviews(title, rating, description)
8
Movie: Title, director, year, genre
Actors: title, actor
Woody Allen Comedies in NY Plays: movie, location, startTime
Reviews: title, rating, description
schema
Chapter 8 Query optimizer
Data Warehouse
Define procedural mappings
Physical query plan in an “ETL tool” to import
the data and clean it.
Replanning request
Execution engine Periodically copy all of the
data from the data sources
Note that the sources
wrapper wrapper wrapper wrapper wrapper
and the warehouse are
basically independent at
source source source source source this point 36
9
Pros and Cons of Data Warehouses Summary of Chapter 1
Need to spend time to design the physical database Data integration: abstract away the fact that data
layout, as well as logical comes from multiple sources in varying schemata.
This actually takes a lot of effort!
Problem occurs everywhere: it’s key to business,
Data is generally not up-to-date (lazy or offline science, Web and government.
refresh)
Goal: reduce the effort involved in integrating.
Queries over the warehouse don’t disrupt the data Regardless of the architecture, heterogeneity is a key
sources issue.
Can run very heavy-duty computations, including Architectures range from warehousing to virtual
data mining and cleaning integration.
37
10