You are on page 1of 2

CM1 – Introduction to Data Integration and Data Exchange

Definitions
Data Integration : Query heterogeneous data in different sources Data Exchange : Transform data structured under a source schema into data
via a virtual global schema structured under a different target schema
Schema Integration : A set of source schemas need to be integrated into one Schema Evolution : An original schema S1 evolves into subsequent
mediated schema version S1’, S1’’ etc.
Schema mappings
What is it? High-level, declarative assertions that specify the relationship between two schemas : specify data Language: specified with logic
interoperability and manipulated by tools ; ESSENTIAL blocks in formalizing data integration and data exchange (first-order logic)
Schema Mapping 𝑴 = (𝑺, 𝑻, 𝚺) : Source S ; Target T ; High-level, declarative Data Exchange via M : Transform a given source instance I to a target
assertions Σ that specify the relationship S-T instance J, so that <I, J> satisfy the specifications of Σ of M.
Embedded dependencies (EDs)
What is it? Kind of constraint on a relational database (most general) ; includes TGDS and EGDS
Tuple-generating dependencies (TDGDs) : inclusion and multivalued dependencies Equality-generating dependencies (EGDs) : functional
(two attributes are independent of each other but, both depend on a third one) dependencies
Schema mappings : specification language
Relationship between source and target s-t TGDs : Examples of simple mapping tasks : Copy – Projection – Decomposition – Column Augmentation
𝛟(𝒙) → ∃𝐲 𝛙(𝐱, 𝐲) – Join – Combinations of the above
Target dependencies : Target TGDs : 𝛟𝐓(𝒙) → ∃𝐲 𝛙𝐓(𝐱, 𝐲) ; Target EGDs : 𝝓𝑻(𝒙) → 𝒙𝟏 = 𝒙𝟐
Data Exchange framework
Schema Mapping 𝑴 = (𝑺, 𝑻, 𝚺𝐬𝐭, 𝚺𝐭) Multiple solutions: given a source instance, multiple solutions may exist
Ex: 𝑬(𝒙, 𝒚) → ∃𝒛 (𝑯(𝒙, 𝒛) ∧ 𝑯(𝒛, 𝒚) 𝒘𝒊𝒕𝒉 𝑰 = {𝑬(𝒂, 𝒃)} | Solutions : infinitely many solutions exist
Main issues: multiple targets instances satisfying the specification of the schema mapping ; which one is the best ?
Universal solutions in data exchange
What is it? The solution has homomorphisms to all other solutions (most general) Ex : 𝐻(𝑎, 𝑋), 𝐻(𝑋, 𝑏) ; 𝐻(𝑎, 𝑋), 𝐻(𝑋, 𝑏), 𝐻(𝑎, 𝑌), 𝐻(𝑌, 𝑏)
Weakly acyclic sets of tgds
Position graph of a set Σ of Edges: for every φ(x)→∃yψ(x,y) in Σ, for every x in x occurring in ψ, for every occurrence of x in φ in position R.A:
target TGDs : Nodes: R.A with R - For every occurrence of x in ψ in position S.B, add an edge R.A S.B
relation symbol, A attribute of R - In addition, for every existentially quantified y that occurs in ψ in position T.C, add a special edge R.A => T.C
Σ is weakly acyclic if the position graph has no cycle containing a special edge
CM2 – Cleaning data with constraints
The data quality problem
Dirty data : data that is inconsistent, inaccurate, incomplete, How does data get dirty? Errors and inconsistencies introduced during data gathering,
stale, or deliberately falsified. Dirty data is costly storage, transmission, transformation, integration…
Data quality tools : help automatically - Discover data quality rules - Reason about these rules - Existing tools : ETL (Extraction – Transformation –
Detect errors based on violations of these rules and Repair (or suggest repairs) of data Loading)
Data quality criteria Consistency (different ages for the same patient) – Accuracy (real-life true value) – Completeness (missing value or tuple) –
Timeliness (too stale or most recent value)
Functional dependencies (FDs): 𝑰 ⊨ 𝑅([𝐴1 , . . . , 𝐴𝑚 ] → 𝐵) Error detection : FD help to detect errors in a single relation
Ex : FD: 𝑐𝑢𝑠𝑡𝑜𝑚𝑒𝑟(𝑁𝐼# → 𝑛𝑎𝑚𝑒, 𝐴𝐶, 𝑝ℎ𝑛, 𝑠𝑡𝑟𝑒𝑒𝑡, 𝑐𝑖𝑡𝑦, 𝑧𝑖𝑝), consider instance I or R : I does not satisfy the FD if 2 tuples with same 𝑁𝐼#.
Inclusion dependencies (INDs): (𝐼, 𝐽) ⊨ 𝑅([𝐴1 , . . . , 𝐴𝑛 ] ⊆ 𝑆[𝐵, … , 𝐵𝑛 ]) Ex: 𝑏𝑜𝑜𝑘[𝑎𝑠𝑖𝑛, 𝑡𝑖𝑡𝑙𝑒, 𝑝𝑟𝑖𝑐𝑒] ⊆ 𝑖𝑡𝑒𝑚[𝑎𝑠𝑖𝑛, 𝑡𝑖𝑡𝑙𝑒, 𝑝𝑟𝑖𝑐𝑒]
These instances dot not satisfy the IND if one book not in items.
Error detection: Inclusion More reasons for using dependencies: capture a fundamental part of the semantics of data (errors and inconsistencies) –
dependencies help to techniques and inference systemes are in place for reasoning about dependencies (data quality rules): remove redundant
detect errors across rules ; identify dirty rules – various algorithms exist to discover dependencies from sample data – repair algorithms, based
relations on the so-called chase procedure, are studied in depth => Dependencies should become part of data cleaning processes
A conditional dependency theory for data consistency
Conditional functional dependencies (CFDs) : extension of FDs; to Ex: “In the UK, the zip code uniquely determines the street” : 𝑅([𝐶𝐶 =
express some semantic properties, we need to add equality with 44, 𝑧𝑖𝑝] → [𝑠𝑡𝑟𝑒𝑒𝑡])
constants
N.B. It is a conditional FD: it may not hold for other countries – It cannot be expressed as standard FDs, but need constants
CINDs: same as FDs, for improving the quality of data Ex: 𝑖𝑡𝑒𝑚[𝑎𝑠𝑖𝑛, 𝑡𝑖𝑡𝑙𝑒, 𝑝𝑟𝑖𝑐𝑒, 𝑡𝑦𝑝𝑒 = 𝑏𝑜𝑜𝑘] ⊆ 𝑏𝑜𝑜𝑘[𝑎𝑠𝑖𝑛, 𝑡𝑖𝑡𝑙𝑒, 𝑝𝑟𝑖𝑐𝑒]
Matching dependency theory for data consistency
Record matching / object identification: to identify tuples from one or Reasons: real-life data is often dirty: errors in the data sources ; and data is
more relations that refer to the same real-world object – data quality, often represented differently in different sources => comparing attributes via
data integration, payment card, fraud detection. equality only does not work !
Matching dependency (MD): using ≍ which MDs vs FDs: a MD is like and FD, except that equalities can be relaxed to similarities ; and it relates to two
is a similarity operator relations
Ex: “If two entities (tuples) agree of their last name and address and if their first names are similar, then the two tuples should be identified on related
attributes": ∀𝑠, 𝑡(𝑐𝑎𝑟𝑑(𝑠) ∧ 𝑡𝑟𝑎𝑛𝑠(𝑡) ∧ 𝑠[𝐿𝑁] = 𝑡[𝐿𝑁] ∧ 𝑠[𝑎𝑑𝑑𝑟𝑒𝑠𝑠] = 𝑡[𝑝𝑜𝑠𝑡] ∧ 𝑠[𝐹𝑁] ≍ 𝑡[𝐹𝑁] → 𝑠[𝑋] = 𝑡[𝑌])
N.B. X and Y are compatible attributes in card and trans, respectively.
Dynamic semantics: matching tuples are obtained from an instance that Matching keys: a minimal set of attributes can be identified that allow
satisfies the MDs. to match two tuples
Recall: Current state: both theoretical and pratical aspects of MDs
- MDs have dynamic semantics: they actually change tuples by means of have been investigated, but their interaction with other data
identification quality aspects needs to be explored further.
- MDs allow for automated reasoning: to infet new MDs and identify matching keys
- MDs have a formalism that is similar to that of CFDs: integration of consistency
and matching.
Other kinds of dependencies
Currency dependencies (CDs): extend FDs by allowing a temporal partial order ≺ partial order on each attribute: 𝑎 ≺ 𝑏 if b is more recent than a.
Editing rules (eRs): provide a dynamic semantics to CFDs and incorporate user interaction
Key algorithmic challenges
Automatic discovery of data quality ryles → Automated methods for reasoning about data quality dependencies → Automatically check wheter the data
is dirty or clean → Automatically fix ? (NP-Complete)
How to fix errors? Dependencies indicate possible repairs but it is inaccurate or simply incorrect – certain fixes : 100% correct. The need for this is evident
when repairing critical data (every update guarantees to fix an error : the repairing process does not introduce new errors).
Editing rules + certain attributes: tell which values to select in repairs ; and only Interaction: repairing can help matching – matching can help repairing
chase when the premise of the dependency contains certified attributes only.
In reality: customers are hesitant to see their data automatically cleaned (suggestions are welcome, certain fixes uses reference data only is OK)
CM3 – Repairing with quality improving dependencies
QIDs and the repair problem
The ingredients: dependencies + repair model Repairs, what it is? A repair od a database 𝐷 relative to a set Σ of data
- Ex: a quality dependency + a dirty database + a repair model (deletion, quality dependencies is a database 𝐷′ such that : 1. 𝐷′ satisfies all the
insertion, edition) + a cost function constraints in Σ; and 2. 𝐷 and 𝐷′ maximise a certain quality metric on
- Ex ctn’d: result with two possible repairs, if only deletions are allowed databases
Different approaches to data repairing: repair is not unique ; two different ways of dealing with (multiple) repairs and queries over them:
Consistent query answering: avoid selecting a repair ; and at query time only return Data repairing: select the best possible repair ; and which is
query answers that are common to all repairs (challenge : How to compute certain subsequently queried (challenge: how to compute a best
answers without computing all repairs). repair?)
Specification of data quality rules: formalisms should be expressive enough to specify data How are data qualities specified? Using logical
quality rules; and simple enough such that reasoning over them is (rather) efficient) formalism.
Data quality dependencies: FDs (signature equalities only) – CFDS (signature equalities Data quality dependency specification language: QID (Quality
and equality with constants) – MDs (signature equalities and similarity relations improving dependency)
Repair models: tell which modifications are allowed and what cost function is optimized
Subset-repair (S-repair): tuple deletions Symmetric-difference-repair (𝚫-repair) : tuple insertions and deletions
Value-modification repair (V-repair): tuple deletions, insertions and attribute-value modifications.
Repairing by chasing
Finding repairs, why the chase? Chase inputs : TGDs and EGDs + database D, possibly Chase solves the problem of data repairing, at least for EGDs
with unknown values ; if the chase terminates and is successful then it outputs and and TGDs, without any cost function. But standard chase
instance 𝐷’ such that 𝐷′ ⊨ Σ. needs revisiting.
Standard chase of QIDs: constants overwrite labeled nulls but The need for revisiting the chase: 1. Chooses between different constants when QIDs are
can fail (two diff constants) ; when considering also constant fired, based on some additional information ; 2. Overwrite values based on constants in
QIDs (with constant equality in conclusion) → chase not the conclusion of QIDs ; and 3. If no information is available, we replace different
defined ⇒ we need to modify the chase for finding repairs. constants by a special value !!! Should happened locally and not over the entire table !!!
Repairing with QIDs
Resolving FD-violations: V-repair cost function to choose between values when chasing + local Resolving CFDs: even revised does not always lead to
changes + result may contain special symbols if no clear choice can be made a repair
Resolving MD-violations: similar to how FD-violations are resolved but the chase takes into account the similarity relations when firing a QID.
Repairing in the presence of master data
Master data: reference data that is trusted and clean Certified attributes: attributes which are assured to be correct
Chasing with master data and certified attributes: values of master Provides a uniform way of repairing data for QIDs ; By selecting certified
always preferred ; QIDs fired only when attributes in the premise are attributes carefully, one can impose that only a unique repair is obtained
certified. certain fix.
Confidence-based repairing: annotate attribute/values with confidence values – during the chase, these confidences values get propagated – a QID is
fired, only when the confidence of values does not decrease. ⇒ each chase step improves the quality of the data
CM4 – Querying Graphs
Graph queries
Query language capabilities: subgraph matching + finding nodes connected by paths (+ approximate matching, aggregation, comparing paths)
Subgraph matching: conjunctive queries on Ex: People and the doctors of their friends
graphs 𝑄1 = (? 𝑝, 𝑓𝑟𝑖𝑒𝑛𝑑𝐷𝑜𝑐𝑡𝑜𝑟, ? 𝑑)
- An edge pattern: a triple (𝑠, ℓ, 𝑡) with s ← (? 𝑝, 𝑘𝑛𝑜𝑤𝑠, ? 𝑓), (? 𝑓, 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑂𝑓, ? 𝑑)
and t constants or variables, and ℓ ∈ ℒ Ex: Patients and their friends
is an edge label 𝑄2 = ⟨? 𝑝, ? 𝑓⟩
- A query rule: a pattern ℎ𝑒𝑎𝑑 ← 𝑏𝑜𝑑𝑦 ← (? 𝑝, 𝑘𝑛𝑜𝑤𝑠, ? 𝑓), (? 𝑝, 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑂𝑓, ? 𝑑)
where head and body are sets of edge
patterns such that every variable
occurring in head occurs in body
- A query: finite set of rules (of the same
arity)
Ex: Patients and their friends Path navigation: reachability – simplest form of path matching
𝑄3 = ⟨? 𝑝, ? 𝑓⟩ 𝐺 ∗ = {(𝑠, 𝑡) | 𝑡ℎ𝑒𝑟𝑒 𝑖𝑠 𝑎 𝑝𝑎𝑡ℎ 𝑖𝑛 𝐺 𝑓𝑟𝑜𝑚 𝑠 𝑡𝑜 𝑡 }
← (? 𝑝, 𝑘𝑛𝑜𝑤𝑠, ? 𝑓), (? 𝑝, 𝑘𝑛𝑜𝑤𝑠, ? 𝑓), (? 𝑝, 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑂𝑓, ? 𝑑) Path navigation: label-constrained reachability – generalizing
𝑄(𝐺) = {⟨𝑘𝑜𝑡𝑎𝑟𝑜, 𝑠𝑎𝑜𝑟𝑖⟩, ⟨𝑘𝑜𝑡𝑎𝑟𝑜, 𝑠𝑟𝑖𝑟𝑎𝑚⟩, . . . } (homomorphisms) 𝐺𝐿∗ = {(𝑠, 𝑡) | 𝑡ℎ𝑒𝑟𝑒 𝑖𝑠 𝑎 𝑝𝑎𝑡ℎ 𝑖𝑛 𝐺 𝑓𝑟𝑜𝑚 𝑠 𝑡𝑜 𝑡 𝑢𝑠𝑖𝑛𝑔 𝑜𝑛𝑙𝑦 𝑒𝑑𝑔𝑒𝑠
𝑄(𝐺) = {⟨𝑘𝑜𝑡𝑎𝑟𝑜, 𝑠𝑎𝑜𝑟𝑖⟩, ⟨𝑘𝑜𝑡𝑎𝑟𝑜, 𝑠𝑟𝑖𝑟𝑎𝑚⟩, . . . } (isomorphisms) 𝑤𝑖𝑡ℎ 𝑙𝑎𝑏𝑒𝑙𝑠 𝑖𝑛 𝐿 }
Path navigation: regular path queries (RPQs) – queries of the form ⟨? 𝑥, ? 𝑦⟩ ← (? 𝑥, 𝑟, ? 𝑦) 𝑤ℎ𝑒𝑟𝑒 𝑟 𝑖𝑠 𝑎 𝑟𝑒𝑔𝑢𝑙𝑎𝑟 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑜𝑣𝑒𝑟 ℒ
Unions of conjunctions of RPQs: ex: Doctors and the patients in both their social and treatment networks
𝑄 = ⟨? 𝑑, ? 𝑝⟩ ← (? 𝑝, 𝑘𝑛𝑜𝑤𝑠 ∗ , ? 𝑓), (? 𝑝, 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑂𝑓 ∗ , ? 𝑓)

You might also like