Professional Documents
Culture Documents
Data Warehousing
Lecture-22
DQM: Quantifying Data Quality
Ahsan Abdullah
Assoc. Prof. & Head
Center for Agro-Informatics Research
www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
Email: ahsan101@yahoo.com
1
DWH-Ahsan Abdullah
Background
Companies want to measure the quality of their data that requires
usable metrics.
Ratios
Min-Max
4
Data Quality Assessment Techniques
Simple Ratios
Free-of-Error
Completeness
Schema Sub-Sub-bullets will not go to graphics
Column
Population
Consistency
Ratio of violations to total number of consistency
checks.
5
Data Quality Assessment Techniques
Min-Max
Sub-bullets and keys will not go to graphics
Used for multiple values, based on aggregation of normalized individual values
Min is conservative, while max is liberal
Believability
Comparison with a standard or experience
Min {0.8, 0.7, 0.6) = 0.6
Weighted average
Accessibility
Max {0, 1- Trd/Tru}
7
Data Quality Validation Techniques
8
Referential Integrity Validation
RI checked every week or month, and no. of orphan
records should be going down with time.
Yellow will not go to graphics
RI peculiar to DWH, not for operational systems
9
Business Case for RI
10
Performance Case for RI
Cost of enforcing RI is very high for large volume DWH
implementations, therefore:
11
3 steps of Attribute Domain Validation
Step-1: Capture and quantify the occurrences of
each domain value within each coded attribute of
the database.
Yellow will go to graphics
12
Attribute Domain Validation: What next?
What to do next?
Trace back to source cause(s).
13
Data Quality Rules
14
Statistical Validation using Histogram
Spike of
Centurions (age >= 100 yrs)
outliers