1 IntegrationApproaches

Integration Approaches
Scalable Data Engineering
1
Version: Friday, October 6, 2023
Why data integration?
Data Growth
▪ The Global DataSphere reached ~100ZB in 2023, estimated 175 ZB in 2025 [1]
Distributed Information
▪ Multiple heterogeneous systems und applications
▪ Technical and organizational reasons
Necessity of Integration
▪ “Application infrastructure and middleware market” growth by 10.4% pa
from 48.9 Mrd. USD in 2021 to 88.4 Mrd USD in 2027 [2]
▪ Integration is one of the biggest/costly challenges for IT companies
(40% of budgets spend for integration software and projects)
[1] - David Reinsel, John Gantz, John Rydning – Data Age 2025: The Digitization of the World From Edge to Core, 2018
[2] - Global Application Infrastructure Middleware Industry Research Report, Competitive Landscape, Market Size, Regional Status and Prospect, 2022 2
Distributed Information Systems
Reasons for Distribution

▪ Database technology was created as a way for integration and should have solved the problem
→ single company-database
▪ Problems: forward-looking design, requirements cannot be fulfilled completely (restructuring,
advancement, pragmatism)
→ department-/specialty-databases
State of Distributed Information Systems

▪ Company data is stored in several databases
▪ Distribution not disjoint, mostly heterogeneous representations
▪ Complex workflows: change of data in application A → change of data in applications B,C,….
→ Assure consistency 
3
Terminology
Homogeneity
▪ Homogeneity ( greek homo/homoios = uniform/similar) describes the similarity/uniformity of things
▪ In computer science: uniformity of characteristics, technologies, systems, or concepts
Heterogeneity
▪ Heterogeneity (also inhomogeneity) describes the difference of things
Integration
▪ Integration (lat. Integer = whole) describes the creation of a whole.
▪ In computer science: the global consolidation of certain local integration objects (systems, applications,
data sets, or functions) using a defined type of integration
4
Types of Heterogeneity
Heterogeneity
Structural
Semantic Heterogeneity
Heterogeneity
Attribute Same attribute in

Missing Data
Heterogeneities different structures
Synonyms/Homonyms Null Values Handling Sets
Mappings Attribute name does not

Virtual Columns
define semantics
Union Types
Semantic
Language Incompatibility Attribute composition
Expressions
Joachim Hammer, Mike Stonebraker, Oguzhan Topsakal : THALIA: Test Harness for the Assessment of Legacy Information
Integration Approaches, technical report, 2004 5
Attribute-Heterogeneity
Synonyms/Homonyms
▪ Different attribute names with same semantics (earnings  → profit)
Mappings
▪ Related attributes of different schemata differ in transformation/derivation of data
▪ For example: temperature in °C and K and °F
▪ Time: Time zones and daylight-saving time
Union Types
▪ Attributes of different schemata use different datatypes to represent the same information
(1024  → “1024“)
Language
▪ Names or values of identical attributes are expressed in different languages
▪ Attributes: Stadt  → City, Name is by chance compatible for German and English
▪ Values: München  → Munich, Sachsen  → Saxony, Dresden and Berlin (Names are especially hard)
6
Missing Data
Null-Values (Missing Values)

▪ The value of that attribute does not exist. This is especially problematic when the attribute does not allow
null-values
▪ What are possible reasons for missing values and how to find replacements (see later lecture)
Virtual Columns
▪ Information is explicitly given in one schema but is derived in another schema
- schema A: year of birth, age  → schema B: year of birth, age = current year – year of birth
▪ Especially important when working with different representations of the data, e.g., views
- Extraction of base data may not contain the desired information
Semantic Incompatibility
▪ A real-world concept is modelled in one schema, but not in the other.
7
Types of Heterogeneity
Heterogeneity
Structural
Semantic Heterogeneity
Heterogeneity
Attribute Same attribute in

Missing Data
Heterogeneities different structures
Synonyms/Homonyms Null Values Handling Sets
Mappings Attribute name does not

Virtual Columns
define semantics
Union Types
Semantic
Language Incompatibility Attribute composition
Expressions
Joachim Hammer, Mike Stonebraker, Oguzhan Topsakal : THALIA: Test Harness for the Assessment of Legacy Information
Integration Approaches, technical report, 2004 8
Structural Heterogeneity
Same attribute in different structures

▪ The same/related attribute exists in different schemata at different positions (with different structure)
▪ A: room in building  → B: room in department in building
Handling of sets
▪ A set of values can be expressed as one multi-value attribute in one schema and as a set of single-value
attributes in another schema. (tutors  → tutor1, tutor2,…)
Attribute name without semantics

▪ The attribute name does not describe its semantic adequately (Attribute11).
Composition of attributes
▪ The same information can be organized by one attribute or a set of attributes in the form of a hierarchy.
(„Lecture SDE Mon.“  → Lecture[title:SDE; day:Mon] )
9
Goals of Integration
Data consolidation / Homogenization

▪ For the purpose of reporting / analysis / prediction / archiving …
▪ Goal: global, consistent view on all systems
▪ Merge of data sets
▪ Data cleansing, error correction
Interoperability
▪ Goal: data synchronization
▪ Interaction between independent systems and applications
Performance / Scalability / Availability

▪ Goal: Ensure quality of service
▪ HA (High Availability), Disaster Recovery
▪ Load Balancing (query performance vs. synchronization) and virtualization
10
Classification of Integration Approaches
11
Classification by Domain
GUI Integration GUI Integration

▪ Uniform visualisation / access to heterogeneous data sets
▪ E.g.: Portals, Mashups
Process Integration
Process Integration
▪ Process components of homogeneous services
▪ E.g.: BPEL Engines, WSMS Application Integration
Application Integration
Information
Information Integration
Integration
▪ Integration of heterogeneous systems and applications
▪ E.g.: EAI Server, MOM, ETL Tools Function Integration
SDE
Information Integration Data Integration
▪ Queries on global schemata
▪ E.g.: VDBMS/FDBMS, ETL Tools, DSMS, PubSub, Replication
12
Classification by Time/Consistency
Synchronous (Strong Consistency) S1

▪ Strong consistency between replicates
▪ Distributed transactions using 2PC (or 1PC, 3PC)
▪ Little local autonomy, data-driven integration S2
Asynchronous (Weak Consistency) S1

▪ Weak consistency allows asynchronous replication
▪ Batch-updates for efficient processing
▪ High performance and high local autonomy
▪ Latency between updates of primary copy and replicates
▪ Event model: ad hoc / data / time
S2
Eventual Consistency (Special case of weak consistency)
▪ Guarantees that system strives for consistent state, if no new updates occur
▪ Variants: Causal consistency, Read-your-writes, Session consistency, Monotonic reads/write
13
Event Model
Data/Ad-Hoc (synchronous or asynchronous)

▪ Integration is triggered by updates of source systems or ad-hoc queries (distributed databases)
▪ Event – Condition – Action (ECA) principle
- Event: Trigger, e.g. incoming query/message
- Condition: evaluation of events, e.g. Priority=high
- Action: operation or processing if condition is true
Example: Message msg3 (B) msg3 (B)
Filter: Message
E:‘receive msg‘, msg2 (A) Filter
C:‘type=A‘,
A:‘delete‘ msg1 (B) msg1 (B)
Time (asynchronous)
▪ Integration is triggered periodically
▪ Asynchronous processing with complete or incremental changes
▪ Typical for materialized integration, daily updated data, e.g. ETL
14
Classification by Query Processing
Materialized vs. virtual integration
Systems for querying heterogeneous data (sources)
Materialized Systems Virtual Integrated Systems

„move the data“ „let the data where it is“
Universal DBMS Data Warehouse (Meta)search Federated Mediated

Engines Databases Query
native and Systems
derived unstructured mostly
native structured data structured unstructured,
structured data native data semistructured or
native
store all -> discard data structured native
source replicate, query sources data
transform
illusion of
keep source -> single
update DBMS
Ruxandra Domenig, Klaus R. Dittrich: An Overview and Classification of Mediated Query Systems, SIGMOD Record, Vol. 28, No. 3, 1999
15
Virtual vs. Materialized
VDBMS
Virtual (logic) integration data locality virtual
▪ Data is not available physically (only logical) time model synchronous
▪ Global queries must be translated into local queries event model ad hoc / data
for the source-systems
topology hierarchy / hub
Pros
▪ Full flexibility APP 1 APP 2
▪ Up-to-date query
Q
▪ No cost for synchronization
(Pull-Integration)
VDBMS /
Cons Mediator
▪ 2PC for updates necessary query query
▪ Query performance Q‘ Q‘‘‘
query
▪ No history/archiving Q‘‘
S1 S2 S3
16
Distributed/Homogeneous databases – top down VDBMS /

Mediator
▪ Top-Down design approach
- One conceptual DB schema
- (mostly) homogeneous system landscape (same vendor DBs, ...)
▪ DB-access like in a centralized case
▪ Limited / no autonomy of involved DBs
▪ Motivation: availability, performance, scalability S1 S2 S3
Federated/Heterogeneous databases– bottom up

▪ Consolidation of related information, distributed over several (heterogeneous) DBs
- Designed and operated independently
▪ Characteristics
- Heterogeneous data models and transaction management possible
- Problems with heterogeneity/partial export of schema information
▪ Motivation: data consolidation and interoperability
17
ETL/DWH
Materialized (physical) integration data locality materialized
▪ Data is available physically time model asynchronous
▪ Local queries on consolidated data event model data/time
Pros topology hierarchy / hub

▪ Query performance (read-optimized)
APP 1 APP 2
▪ Independence from source systems
▪ history, archiving, consolidation Query queries
execution
Cons
DWH
▪ Synchronization (up-to-date data)
▪ Flexibility/autonomy ETL-Processes
(Push Integration)
▪ Data is kept redundantly
S1 S2 S3
18
Summary
Distributed Information
▪ There is a lot of distributed information
▪ Integration is necessary to get coherent data for subsequent tasks as reporting or analysis
Heterogeneity and Integration

▪ Information from different sources often exhibit different types of heterogeneity
▪ Integration aims towards a consistent representation of the information and has to overcome
heterogeneity
▪ Integration approaches can be classified regarding various properties
19

1 IntegrationApproaches

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 IntegrationApproaches

Uploaded by

Copyright:

Available Formats

Integration Approaches

Scalable Data Engineering

Reasons for Distribution

State of Distributed Information Systems

Attribute Same attribute in

Synonyms/Homonyms Null Values Handling Sets

Mappings Attribute name does not

Null-Values (Missing Values)

Attribute Same attribute in

Synonyms/Homonyms Null Values Handling Sets

Mappings Attribute name does not

Same attribute in different structures

Attribute name without semantics

Data consolidation / Homogenization

Performance / Scalability / Availability

GUI Integration GUI Integration

Synchronous (Strong Consistency) S1

Asynchronous (Weak Consistency) S1

Data/Ad-Hoc (synchronous or asynchronous)

Materialized vs. virtual integration

Systems for querying heterogeneous data (sources)

Materialized Systems Virtual Integrated Systems

Universal DBMS Data Warehouse (Meta)search Federated Mediated

Distributed/Homogeneous databases – top down VDBMS /

Federated/Heterogeneous databases– bottom up

Pros topology hierarchy / hub

Heterogeneity and Integration

You might also like