You are on page 1of 24

CHAPTER 14: DATA

PROVENANCE

PRINCIPLES OF
DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
“Where Did this Data Come from?”
Challenge: integrated data may come from many
sources and mappings – of different quality or
trustworthiness!
 How did I get this particular result?
 What mappings produced it?
 How much should I trust (believe) it?

Data provenance (lineage) captures the relationships


between tuples in a set of data instances

2
An Example: View Tuple Derivations
Source relations
R A B S B C
1 2 2 3
2 4 3 2
4 3

View V1 = R ⋈ S ∪ S ⋈ S

A C directly derivable by
1 3 R(1,2) ⋈ S(2,3) ∪ R(1,4) ⋈ S(4,3)

2 2 S(2,3) ⋈ ρB  A, C  B S(3,2)
3 3 S(3,2) ⋈ ρB  A, C  B S(2,3)
3
Formulating a Provenance Model
Conceptually, provenance captures the operations
and operands going into a result
There are many options to do this, and many levels of detail!

A “good” provenance model should:


 Have a formal semantics
 Have equivalence properties such that equivalent query
plans produce equivalent provenance
 Connect to notions of value, quality or score

4
Outline
 The two views of provenance
 Applications of data provenance
 Provenance semirings: one ring to rule them all
 Storing provenance

5
Provenance as Annotations on Data

 Annotate each derivation with an “explanation” in


terms of relational algebra and the tuple operands
 Lets us “look up” the derivation of a result
View V1 (in Datalog):
R A B
1 2
V1(x,z) :- R(x,y), S(y,z)
1 4 V1(x,x) :- S(x,y), S(y,x)

A C provenance annotation
S B C
1 3 R(1,2) ⋈ S(2,3) ∪ R(1,4) ⋈ S(4,3)
2 3
3 2 2 2 S(2,3) ⋈ ρB  A, C  B S(3,2)
4 3 3 3 S(3,2) ⋈ ρB  A, C  B S(2,3) 6
Provenance as a Graph of Relationships

 Bipartite graph: tuple nodes connected via “derivation nodes”


 Encodes a hypergraph (hyperedges = derivations)
 Makes direct derivation relationships more explicit

derives via
V1

R(1 ,4) S(4,3)


derives via
V1

R(1,2) S(2,3) V1 (1,3)


derives via
V1
S(3,2) V1 (2,2)

derives via
7
V1 V1 (3,3)
Making the Two Interchangeable

 We can make these equivalent by introducing


provenance tokens (equiv. node IDs) for each tuple
 Derived tuples’ annotations = expressions over tokens

V
R A B ann 1

r2 s3
1 2 r1
V
1

1 4 r2
r1 s1 v1
V
S B C ann V1 A C anns2
1
v2
2 3 s1 1 3 v1 = r1 ⋈ s1 ∪ r2 ⋈ s3 V
1 v3
3 2 s2 2 2 v2 = s1 ⋈ s2
4 3 s3 3 3 v3 = s2 ⋈ s1 8
Outline
 The two views of provenance
 Applications of data provenance
 Provenance semirings: one ring to rule them all
 Storing provenance

9
Where Can We Use Provenance?
Explanations
 Help the user understand why an item exists

Scoring
 Provide a ranked list of “most relevant” results

Reasoning about interactions


 Help the user understand data relationships
Examples of Provenance’s Utility
Schema mapping debugging:
We may have a bad result
 Determine why that result exists, what is faulty
Bioinformatics data integration:
Different sources have different levels of reliability or
authoritativeness
 Rank results by score!
Probabilistic databases:
We may need to know that results are correlated
 Encode the relationships, use to assign probabilities
Outline
 The two views of provenance
 Applications of data provenance
 Provenance semirings: one ring to rule them all
 Storing provenance

12
The Notion of Provenance as
Annotations
 Many formalisms were defined for using query
computations to produce annotations
 Each captured certain subtleties

 The key question: Is there one “most powerful”


model that captures the properties of the relational
algebra*?
 Equivalent queries should produce equivalent provenance

* over multi-sets or bags, as used by “real” systems


The Provenance Semiring Model
To represent provenance, use:
 A set of provenance tokens or tuple IDs, K

 Abstract operators representing combination of tuples


Abstract sum operator, ⊕, for union or projection
has identity element 0 (a ⊕ 0 ≡ 0 ⊕ a ≡ 0)

Abstract product operator, ⊗, for join


 has identity element 1 (a ⊗ 1 ≡ 1 ⊗ a ≡ 1)
 also (a ⊗ 0 ≡ 0 ⊗ a ≡ 0)

This is formally a commutative semiring

14
The Provenance Semiring Model

 We can re-express our example as below, using the


semiring operators instead of the relational algebra
ones

V
R A B ann 1

r2 s3
1 2 r1
V
1

1 4 r2
r1 s1 v1
V
S B C ann V1 A C Anns2
1
v2
2 3 s1 1 3 v1 = r1 ⊗ s1 ⊕ r2 ⊗ s3 V
1 v3
3 2 s2 2 2 v2 = s1 ⊗ s2
4 3 s3 3 3 v3 = s2 ⊗ s1 15
Tokens for Mappings

 Sometimes we would like to assign a token to the actual


mapping or rule used – so we can assign it a value

View V1 (in Datalog): Call this m1


R A B ann
1 2 r1
V1(x,z) :- R(x,y), S(y,z)
1 4 r2
V1(x,x) :- S(x,y), S(y,x) Call this m2

S B C ann V1 A C Ann
2 3 s1 1 3 v1 = m1⊗ [r1 ⊗ s1] ⊕ m2⊗ [r2 ⊗ s3]
3 2 s2
2 2 v2 = m2⊗ [s1 ⊗ s2]
4 3 s3
3 3 v3 = m2⊗ [s2 ⊗ s1] 16
Example Application:
Provenance Visualization
Base tuple derivation
(token not shown)

Tuple nodes

Derivation by
mapping M5
Example Application: Tuple Scoring
 For ranked query results, we may adopt the following model commonly used in ranking:
 Assign a score to each base tuple = - log2(probability)
 Use arithmetic sum as ⊗
 Use min as ⊕
 Suppose
 prob(r1) = 0.5, prob(s1) = 0.5, others are 1.0

V1 A C Ann
1 3 v1 = r1 ⊗ s1 ⊕ r2 ⊗ s3 = min((2+1),(1+1)) = 2
2 2 v2 = s1 ⊗ s2 = 2+1 = 3
3 3 v3 = s2 ⊗ s1 = 1+2 = 3
Useful Semirings
Use case Base value Product R ⊗ S Sum R ⊕ S
Derivability True R∧S R∨S
Trust Trust condition R∧S R∨S
result
Confidentiality Tuple More_secure(R,S) Less_secure(R,S)
level confidentiality
level
Weight / cost Base tuple weight R+S min(R,S)

Lineage Tuple ID R∪S R∩S


Probabilistic event Tuple probabilistic R∧S R∨S
event

Number of 1 R⋅S R+S


derivations
19
Outline
 The two views of provenance
 Applications of data provenance
 Provenance semirings: one ring to rule them all
 Storing provenance

20
Storing Provenance
 Use tuple keys as tokens
 Encode provenance graph as relations
View V1 (in Datalog): Relate tuples with table Pv1-1
V1(x,z) :- R(x,y), S(y,z)
V1(x,x) :- S(x,y), S(y,x) Relate tuples with table Pv1-2
R A B
1 2 Pv1-1 R.A R.B S. B S.C V1.A V1.C
1 4 1 2 2 3 1 3
V1 A C
1 4 4 3 1 3
B C 1 3
S
2 3 Pv1-2 S.B S.C S.B’ S.C’ V1.A V1.C 2 2

3 2 3 3
2 3 3 2 2 2
4 3
3 2 2 3 3 3 21
Storing Provenance
 Use tuple keys as tokens
 Encode provenance graph as relations
View V1 (in Datalog): These are redundant
V1(x,z) :- R(x,y), S(y,z) if we know the Datalog
V1(x,x) :- S(x,y), S(y,x)
R A B
1 2 Pv1-1 R.A R.B S. B S.C V1.A V1.C
1 4 1 2 2 3 1 3
V1 A C
1 4 4 3 1 3
B C 1 3
S
2 3 Pv1-2 S.B S.C S.B’ S.C’ V1.A V1.C 2 2

3 2 3 3
2 3 3 2 2 2
4 3
3 2 2 3 3 3 22
Storing Provenance
 Use tuple keys as tokens
 Encode provenance graph as relations
View V1 (in Datalog):
V1(x,z) :- R(x,y), S(y,z)
V1(x,x) :- S(x,y), S(y,x)
R A B
1 2 Pv1-1 A B C
1 4 1 2 3
V1 A C
1 4 3
1 3
S B C
Pv1-2 B C C’ 2 2
2 3
2 3 2 3 3
3 2
4 3 3 2 3
23
Data Provenance Wrap-up
 Provenance is critical to understanding and assessing the
believability of data, and in debugging
 Two equivalent representations – annotations vs graph
 Provenance semiring model preserves the “expected” equivalences
of the relational algebra
 We can take semiring provenance and evaluate it with different
semirings to get useful scores
 We can store provenance using relations
 Recent work beyond the scope of the book:
 Extending provenance to more complex queries, e.g., with
aggregation
 Languages for querying provenance (primarily as a graph)

You might also like