You are on page 1of 45

DATA INTEGRATION

Prepared by: Saidatul Rahah Hamidi


Outline
 Introduction: data integration
 Examples of data integration applications
 Schema heterogeneity
 Goal of data integration, why it’s a hard problem
 Data integration architectures

2
Data Integration
 Databases are great: they let us manage huge amounts of
data
 Assuming you’ve put it all into your schema.
 In reality, data sets are often created independently
 Only to discover later that they need to combine their data!
 At that point, they’re using different systems, different
schemata and have limited interfaces to their data.
 The goal of data integration: tie together different sources,
controlled by many people, under a common schema.

https://www.youtube.com/watch?v=Fh17h6c3sNw
3
Introduction
 Many databases and sources of data that need to be
integrated to work together
 Almost all applications have many sources of data
 Data Integration is the process of integrating data from
multiple sources and probably have a single view over all
these sources
 And answering queries using the combined information
 Integration can be physical or virtual
 Physical: Coping the data to warehouse
 Virtual: Keep the data only at the sources

4
Data Integration
 Data integration is also valid within a single organization
 Integrating data from different departments or sector

5
Goals of Data Integration

 Provide
 Uniform (same query interface to all sources)
 Access to (queries; eventually updates too)
 Multiple (we want many, but 2 is hard too)
 Autonomous (DBA doesn’t report to you)
 Heterogeneous (data models are different)
 Distributed (over LAN, WAN, Internet)
 Data Sources (not only databases).

6
Heterogeneity Problems
 The main problem is the heterogeneity among the data
sources.
 Source Type Heterogeneity : Systems storing the data can
be different

7
Heterogeneity Problems (cont.)
 Communication Heterogeneity
 Some systems have web interface others do not
 Some systems allow direct query language others offer APIs
 Schema Heterogeneity
 the structure of the tables storing the data can be different (even
if storing the same data)

8
Heterogeneity Problems (cont.)
 Data Type Heterogeneity
 Storing the same data (and values) but with different data types
 E.g., Storing the phone number as String or as Number
 E.g., Storing the name as fixed length or variable length
 Value Heterogeneity
 Same logical values stored in different ways
 E.g., ‘Prof’, ‘Prof.’, ‘Professor’
 E.g., ‘Right’, ‘R’, ‘1’ ……… ‘Left’, ‘L’, ‘-1’

9
Heterogeneity Problems (cont.)
 Semantic Heterogeneity
 Same values in different sources can mean different things
 E.g., Column ‘Title’ in one database means ‘Job Title’ while
 in another database it means ‘Person Title’

Data integration has to deal with all


such issues and more

10
Reasons for Heterogeneity

11
Top 10 Data Integration Issues

12 https://tdwi.org/articles/2006/05/09/data-integration-using-etl-eai-and-eii-tools-to-create-an-
integrated-enterprise-report-excerpt.aspx
Motivation
 WWW
 Website construction
 Comparison shopping
 Portals integrating data from multiple sources
 B2B, electronic marketplaces

 Science and culture


 Medical genetics: integrating genomic data
 Astrophysics: monitoring events in the sky.
 Culture: uniform access to all cultural databases produced by
countries in Europe.

https://www.youtube.com/watch?v=MaNjsbdSDZ4
13
Data Integration:
A Higher-level Abstraction

Query Independence of:


• source & location
Mediated Schema • data model, syntax
• semantic variations
• …
Semantic
Mappings
S1 S2 S3
<cd> <title> The best of … </title>

… <artist> Carreras </artist>


<artist> Pavarotti </artist> …
<artist> Domingo </artist>
<price> 19.95 </price>
</cd>
14
Applications of Data Integration
 Business
 Science
 Government
 The Web
 Pretty much everywhere

15
Application Area 1: Business

Enterprise Databases
EII Apps:

CRM
ERP
Single Mediated View
Portals

Legacy Databases
Services and Applications

16
50% of all IT $$$ spent here!
Application Area 2: Science
Sequenceable Structured
Phenotype Gene Experiment
Entity Vocabulary

Nucleotide Microarray
Protein
Sequence Experiment

Swiss-
OMIM HUGO GO
Prot

Gene- Locus-
Entrez GEO
Clinics Link

Hundreds of biomedical data sources available; growing rapidly!


17
Application Area 3: The Web

18
The Deep Web
 Millions of high quality HTML forms out there
 Each form has its own special interface
 Hard to explore data across sites.
 Goal (for some domains):
 A single interface into a multitude of deep-web sources

 Deep Web - A collective term for websites or parts of


websites that aren't indexed by search engines.

19
https://www.deepweb-sites.com/

http://idke.ruc.edu.cn/projects/
web.htm

20
Other Reasons to Integrate Data
 Create a (useful) web site for tracking services
 Collaborate with third parties
 E.g., create branded services
 Comply with government regulations
 Find “risky” employees
 Business intelligence
 What’s really wrong with our products?

21
Goal of Data Integration
 Uniform query access to a set of data sources
 Handle:
 Scale of sources: from tens to millions
 Heterogeneity
 Autonomy
 Semi-structure

22
Why is it Hard?
 Systems-level reasons:
 Managing different platforms
 SQL across multiple systems is not so simple
 Distributed query processing
 Logical reasons:
 Schema (and data) heterogeneity
 ‘Social’ reasons:
 Locating and capturing relevant data in the enterprise.
 Convincing people to share (data fiefdoms)
 Security, privacy and performance implications.

23
Setting Expectations
Data integration is AI-Complete.
 Completely automated solutions unlikely.

Goal 1:
 Reduce the effort needed to set up an integration application.

Goal 2:
 Enable the system to perform gracefully with uncertainty (e.g.,
on the web)

24
Data Integration Smorgasbord
Something for everyone:
 Theory of modeling data sources
 Systems aspects of data integration
 Architectural issues: e.g., P2P data sharing
 AI @ work: automated schema matching
 Web: latest on data integration & web
 Commercial products: BEA, IBM
 Semantic Web: what does it have to offer?
 New trends in DBMS: uncertainty, dataspaces

25
Types of Data Integration
 Data Consolidation
 Data consolidation physically brings data together from several separate
systems, creating a version of the consolidated data in one data store.
Often the goal of data consolidation is to reduce the number of data
storage locations. Extract, transform, and load (ETL) technology
supports data consolidation.
 Data Propagation
 Data propagation is the use of applications to copy data from one
location to another. It is event-driven and can be done synchronously or
asynchronously. Most synchronous data propagation supports a two-way
data exchange between the source and the target. Enterprise application
integration (EAI) and enterprise data replication (EDR) technologies
support data propagation.

26
Types of Data Integration
 Data Virtualization
 Virtualization uses an interface to provide a near real-time, unified view of
data from disparate sources with different data models. Data can be viewed
in one location, but is not stored in that single location. Data virtualization
retrieves and interprets data, but does not require uniform formatting or a
single point of access.
 https://www.youtube.com/watch?v=6Ws-3dOGasE

 Data Federation
 Federation is technically a form of data virtualization. It uses a virtual database and creates a
common data model for heterogeneous data from different systems. Data is brought together and
viewable from a single point of access. Enterprise information integration (EII) is a technology
that supports data federation. It uses data abstraction to provide a unified view of data from
different sources. That data can then be presented or analyzed in new ways through applications.
 Virtualization and federation are good workarounds for situations where data consolidation is
cost prohibitive or would cause too many security and compliance issues.
https://www.coursera.org/learn/data-analytics-business/lecture/SzzGY/3-virtualization-federation-and-in-
memory-computing
27
Type of Data Integration
 Data Warehousing
 Warehousing is included in this list because it is a commonly used term.
However, its meaning is more generic than the other methods previously
mentioned. Data warehouses are storage repositories for data. However,
when the term “data warehousing,” is used, it implies the cleansing,
reformatting, and storage of data, which is basically data integration

Source: https://www.globalscape.com/blog/5-types-data-integration

28
Models of Data Integration
 Federated Databases

 Data Warehousing

 Mediation

29
Federated Databases
 Simplest architecture
 Every pair of sources can build their own mapping and
transformation
 Source X needs to communicate with source Y  build a
mapping between X and Y
 Does not have to be between all sources (on demand)

30
Data Warehousing
 Very common approach
 Data from multiple sources are copied and stored in a
warehouse
 Data is materialized in the warehouse
 Users can then query the warehouse database only

31
Data Warehousing: Synchronization
 How to synchronize the data between the sources and the
warehouse? In both approaches the
warehouse is not up-to-date at all
 Two approaches: times
 Complete rebuild
 Periodically re-build the warehouse from the sources
 (e.g., every night or every week)
 (+) The procedure is easy
 (-) Expensive and time consuming
 Incremental update
 Periodically update the warehouse based on the changes in the sources
 (+) Less expensive and efficient
 (-) More complex to perform incremental update
 (-) Requires sources to keep track of their updates

32
Data Warehousing

33
Traditional DW Architecture

34
Mediation
 Mediator is a virtual view over
the data (it does not store any data)
 Data is stored only at the sources
 Mediator has a virtual schema that
combines all schemas from the sources
 Usually wrappers are the

components that perform the mapping.


 of queries
 The mapping takes place at query time
 This is unlike warehousing where mapping takes place at
upload time

35
Mediation : Example
 Mediator Schema

 Source 1 Schema

 Source 2 Schema

 What if we need, first name, last name, and cell phone of


customer ID =100?

36
Mediation: Example

37
Mediation: Example

38
Virtual, Warehousing and in Between
 Data warehousing: integrate by bringing the data into a
single physical warehouse
 Virtual data integration: leave the data at the sources and
access it at query time.

 Some differences, but semantic heterogeneity arises in


both cases.
 Numerous intermediate architectures.
 The course illustrates data integration technology mostly
through the virtual architecture.

39
Virtual Data Integration Architecture
Mediated Schema
or Warehouse Query reformulation/
Query over materialized data

Source
descriptions/
Transforms

Wrapper / Wrapper / Wrapper / Wrapper /


Extractor Extractor Extractor Extractor

RDBMS1 RDBMS2
HTML1 XML1
40
Entity Resolution
 Data coming from different sources may be different even
if representing the same objects
 Entity resolution is the process of:
 Figuring out which records represent the same thing
 Linking relevant records together

41
Merging Similar Records
 How to merge similar records???
 In some cases, e.g., misspelling synonyms , it is possible to
merge results
 In other cases, e.g., conflicts, there is no easy way to find the
correct values
 Report all the results we have

42
Automated Integration
 Data integration requires a lot of manual effort
 Data warehouse designing and implementing the ETL module
 Mediators designing and implementing the wrappers
 Federated database  designing and implementing the mapping modules
(wrappers)

43
Recent Research

 Consider several database schemas for different bookstores


 How to match their schemas automatically  schema matching
techniques
 How to find matching records  record linkage techniques
 How to find errors, synonyms, etc. and correct them  data cleansing
techniques

44
Summary
 Data integration: abstract away the fact that data comes
from multiple sources in varying schemata.
 Problem occurs everywhere: it’s key to business, science,
Web and government.
 Goal: reduce the effort involved in integrating.
 Regardless of the architecture, heterogeneity is a key
issue.
 Architectures range from warehousing to virtual
integration.

45

You might also like