Professional Documents
Culture Documents
Chapter 14: Data Provenance
Chapter 14: Data Provenance
PROVENANCE
PRINCIPLES OF
DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
“Where Did this Data Come from?”
Challenge: integrated data may come from many
sources and mappings – of different quality or
trustworthiness!
How did I get this particular result?
What mappings produced it?
How much should I trust (believe) it?
2
An Example: View Tuple Derivations
Source relations
R A B S B C
1 2 2 3
2 4 3 2
4 3
View V1 = R ⋈ S ∪ S ⋈ S
A C directly derivable by
1 3 R(1,2) ⋈ S(2,3) ∪ R(1,4) ⋈ S(4,3)
2 2 S(2,3) ⋈ ρB A, C B S(3,2)
3 3 S(3,2) ⋈ ρB A, C B S(2,3)
3
Formulating a Provenance Model
Conceptually, provenance captures the operations
and operands going into a result
There are many options to do this, and many levels of detail!
4
Outline
The two views of provenance
Applications of data provenance
Provenance semirings: one ring to rule them all
Storing provenance
5
Provenance as Annotations on Data
A C provenance annotation
S B C
1 3 R(1,2) ⋈ S(2,3) ∪ R(1,4) ⋈ S(4,3)
2 3
3 2 2 2 S(2,3) ⋈ ρB A, C B S(3,2)
4 3 3 3 S(3,2) ⋈ ρB A, C B S(2,3) 6
Provenance as a Graph of Relationships
derives via
V1
derives via
7
V1 V1 (3,3)
Making the Two Interchangeable
V
R A B ann 1
r2 s3
1 2 r1
V
1
1 4 r2
r1 s1 v1
V
S B C ann V1 A C anns2
1
v2
2 3 s1 1 3 v1 = r1 ⋈ s1 ∪ r2 ⋈ s3 V
1 v3
3 2 s2 2 2 v2 = s1 ⋈ s2
4 3 s3 3 3 v3 = s2 ⋈ s1 8
Outline
The two views of provenance
Applications of data provenance
Provenance semirings: one ring to rule them all
Storing provenance
9
Where Can We Use Provenance?
Explanations
Help the user understand why an item exists
Scoring
Provide a ranked list of “most relevant” results
12
The Notion of Provenance as
Annotations
Many formalisms were defined for using query
computations to produce annotations
Each captured certain subtleties
14
The Provenance Semiring Model
V
R A B ann 1
r2 s3
1 2 r1
V
1
1 4 r2
r1 s1 v1
V
S B C ann V1 A C Anns2
1
v2
2 3 s1 1 3 v1 = r1 ⊗ s1 ⊕ r2 ⊗ s3 V
1 v3
3 2 s2 2 2 v2 = s1 ⊗ s2
4 3 s3 3 3 v3 = s2 ⊗ s1 15
Tokens for Mappings
S B C ann V1 A C Ann
2 3 s1 1 3 v1 = m1⊗ [r1 ⊗ s1] ⊕ m2⊗ [r2 ⊗ s3]
3 2 s2
2 2 v2 = m2⊗ [s1 ⊗ s2]
4 3 s3
3 3 v3 = m2⊗ [s2 ⊗ s1] 16
Example Application:
Provenance Visualization
Base tuple derivation
(token not shown)
Tuple nodes
Derivation by
mapping M5
Example Application: Tuple Scoring
For ranked query results, we may adopt the following model commonly used in ranking:
Assign a score to each base tuple = - log2(probability)
Use arithmetic sum as ⊗
Use min as ⊕
Suppose
prob(r1) = 0.5, prob(s1) = 0.5, others are 1.0
V1 A C Ann
1 3 v1 = r1 ⊗ s1 ⊕ r2 ⊗ s3 = min((2+1),(1+1)) = 2
2 2 v2 = s1 ⊗ s2 = 2+1 = 3
3 3 v3 = s2 ⊗ s1 = 1+2 = 3
Useful Semirings
Use case Base value Product R ⊗ S Sum R ⊕ S
Derivability True R∧S R∨S
Trust Trust condition R∧S R∨S
result
Confidentiality Tuple More_secure(R,S) Less_secure(R,S)
level confidentiality
level
Weight / cost Base tuple weight R+S min(R,S)
20
Storing Provenance
Use tuple keys as tokens
Encode provenance graph as relations
View V1 (in Datalog): Relate tuples with table Pv1-1
V1(x,z) :- R(x,y), S(y,z)
V1(x,x) :- S(x,y), S(y,x) Relate tuples with table Pv1-2
R A B
1 2 Pv1-1 R.A R.B S. B S.C V1.A V1.C
1 4 1 2 2 3 1 3
V1 A C
1 4 4 3 1 3
B C 1 3
S
2 3 Pv1-2 S.B S.C S.B’ S.C’ V1.A V1.C 2 2
3 2 3 3
2 3 3 2 2 2
4 3
3 2 2 3 3 3 21
Storing Provenance
Use tuple keys as tokens
Encode provenance graph as relations
View V1 (in Datalog): These are redundant
V1(x,z) :- R(x,y), S(y,z) if we know the Datalog
V1(x,x) :- S(x,y), S(y,x)
R A B
1 2 Pv1-1 R.A R.B S. B S.C V1.A V1.C
1 4 1 2 2 3 1 3
V1 A C
1 4 4 3 1 3
B C 1 3
S
2 3 Pv1-2 S.B S.C S.B’ S.C’ V1.A V1.C 2 2
3 2 3 3
2 3 3 2 2 2
4 3
3 2 2 3 3 3 22
Storing Provenance
Use tuple keys as tokens
Encode provenance graph as relations
View V1 (in Datalog):
V1(x,z) :- R(x,y), S(y,z)
V1(x,x) :- S(x,y), S(y,x)
R A B
1 2 Pv1-1 A B C
1 4 1 2 3
V1 A C
1 4 3
1 3
S B C
Pv1-2 B C C’ 2 2
2 3
2 3 2 3 3
3 2
4 3 3 2 3
23
Data Provenance Wrap-up
Provenance is critical to understanding and assessing the
believability of data, and in debugging
Two equivalent representations – annotations vs graph
Provenance semiring model preserves the “expected” equivalences
of the relational algebra
We can take semiring provenance and evaluate it with different
semirings to get useful scores
We can store provenance using relations
Recent work beyond the scope of the book:
Extending provenance to more complex queries, e.g., with
aggregation
Languages for querying provenance (primarily as a graph)