Professional Documents
Culture Documents
1
Version: Friday, October 6, 2023
Why data integration?
Data Growth
▪ The Global DataSphere reached ~100ZB in 2023, estimated 175 ZB in 2025 [1]
Distributed Information
▪ Multiple heterogeneous systems und applications
▪ Technical and organizational reasons
Necessity of Integration
▪ “Application infrastructure and middleware market” growth by 10.4% pa
from 48.9 Mrd. USD in 2021 to 88.4 Mrd USD in 2027 [2]
▪ Integration is one of the biggest/costly challenges for IT companies
(40% of budgets spend for integration software and projects)
[1] - David Reinsel, John Gantz, John Rydning – Data Age 2025: The Digitization of the World From Edge to Core, 2018
[2] - Global Application Infrastructure Middleware Industry Research Report, Competitive Landscape, Market Size, Regional Status and Prospect, 2022 2
Distributed Information Systems
→ Assure consistency
3
Terminology
Homogeneity
▪ Homogeneity ( greek homo/homoios = uniform/similar) describes the similarity/uniformity of things
▪ In computer science: uniformity of characteristics, technologies, systems, or concepts
Heterogeneity
▪ Heterogeneity (also inhomogeneity) describes the difference of things
Integration
▪ Integration (lat. Integer = whole) describes the creation of a whole.
▪ In computer science: the global consolidation of certain local integration objects (systems, applications,
data sets, or functions) using a defined type of integration
4
Types of Heterogeneity
Heterogeneity
Structural
Semantic Heterogeneity
Heterogeneity
Joachim Hammer, Mike Stonebraker, Oguzhan Topsakal : THALIA: Test Harness for the Assessment of Legacy Information
Integration Approaches, technical report, 2004 5
Attribute-Heterogeneity
Synonyms/Homonyms
▪ Different attribute names with same semantics (earnings → profit)
Mappings
▪ Related attributes of different schemata differ in transformation/derivation of data
▪ For example: temperature in °C and K and °F
▪ Time: Time zones and daylight-saving time
Union Types
▪ Attributes of different schemata use different datatypes to represent the same information
(1024 → “1024“)
Language
▪ Names or values of identical attributes are expressed in different languages
▪ Attributes: Stadt → City, Name is by chance compatible for German and English
▪ Values: München → Munich, Sachsen → Saxony, Dresden and Berlin (Names are especially hard)
6
Missing Data
Virtual Columns
▪ Information is explicitly given in one schema but is derived in another schema
- schema A: year of birth, age → schema B: year of birth, age = current year – year of birth
▪ Especially important when working with different representations of the data, e.g., views
- Extraction of base data may not contain the desired information
Semantic Incompatibility
▪ A real-world concept is modelled in one schema, but not in the other.
7
Types of Heterogeneity
Heterogeneity
Structural
Semantic Heterogeneity
Heterogeneity
Joachim Hammer, Mike Stonebraker, Oguzhan Topsakal : THALIA: Test Harness for the Assessment of Legacy Information
Integration Approaches, technical report, 2004 8
Structural Heterogeneity
Handling of sets
▪ A set of values can be expressed as one multi-value attribute in one schema and as a set of single-value
attributes in another schema. (tutors → tutor1, tutor2,…)
Composition of attributes
▪ The same information can be organized by one attribute or a set of attributes in the form of a hierarchy.
(„Lecture SDE Mon.“ → Lecture[title:SDE; day:Mon] )
9
Goals of Integration
Interoperability
▪ Goal: data synchronization
▪ Interaction between independent systems and applications
10
Classification of Integration Approaches
11
Classification by Domain
Application Integration
Information
Information Integration
Integration
▪ Integration of heterogeneous systems and applications
▪ E.g.: EAI Server, MOM, ETL Tools Function Integration
SDE
Information Integration Data Integration
▪ Queries on global schemata
▪ E.g.: VDBMS/FDBMS, ETL Tools, DSMS, PubSub, Replication
12
Classification by Time/Consistency
13
Event Model
14
Classification by Query Processing
Ruxandra Domenig, Klaus R. Dittrich: An Overview and Classification of Mediated Query Systems, SIGMOD Record, Vol. 28, No. 3, 1999
15
Virtual vs. Materialized
VDBMS
Virtual (logic) integration data locality virtual
▪ Data is not available physically (only logical) time model synchronous
▪ Global queries must be translated into local queries event model ad hoc / data
for the source-systems
topology hierarchy / hub
Pros
▪ Full flexibility APP 1 APP 2
▪ Up-to-date query
Q
▪ No cost for synchronization
(Pull-Integration)
VDBMS /
Cons Mediator
▪ 2PC for updates necessary query query
▪ Query performance Q‘ Q‘‘‘
query
▪ No history/archiving Q‘‘
S1 S2 S3
16
Virtual vs. Materialized
17
Virtual vs. Materialized
ETL/DWH
Materialized (physical) integration data locality materialized
▪ Data is available physically time model asynchronous
▪ Local queries on consolidated data event model data/time
S1 S2 S3
18
Summary
Distributed Information
▪ There is a lot of distributed information
▪ Integration is necessary to get coherent data for subsequent tasks as reporting or analysis
19