You are on page 1of 15

Virtual University of Pakistan

Data Warehousing
Lecture-22
DQM: Quantifying Data Quality

Ahsan Abdullah
Assoc. Prof. & Head
Center for Agro-Informatics Research
www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
Email: ahsan101@yahoo.com
1
DWH-Ahsan Abdullah
Background
Companies want to measure the quality of their data that requires
usable metrics.

Have to deal with both the subjective perceptions and objective


measurements.
Text will not go to graphics
Subjective data quality assessments reflect the needs and
experiences of stakeholders.

Objective assessments can be task-independent or task-dependent.

Task-independent metrics reflect states of the data without the


contextual knowledge of the application.

Task dependent metrics, include organization’s business rules,


regulations etc.

We will discuss objective assessment and validation techniques


(dependent & independent), if time permits will briefly cover
subjective assessment too. 2
More on Characteristics of Data Quality
Data Quality Dim Definition Only this column will go to graphics

Believability The extent to which data is regarded as true and


credible.
Appropriate The extent to which the volume of data is appropriate
Amount of for the task at hand.
Data
Timeliness A measure of how current or up to date the data is.
Accessibility The extent to which data is available, or easily and
quickly retrievable
Objectivity The extent to which data is unbiased, unprejudiced,
and impartial.
Interpretability The extent to which data is in appropriate languages,
symbols, and units, and the definitions are clear.
Uniqueness The state of being only one of its kind or being
without an equal or parallel. 3
Data Quality Assessment Techniques

 Ratios

 Min-Max

4
Data Quality Assessment Techniques
 Simple Ratios
 Free-of-Error
 Completeness
 Schema Sub-Sub-bullets will not go to graphics
 Column
 Population
 Consistency
Ratio of violations to total number of consistency
checks.
5
Data Quality Assessment Techniques
 Min-Max
 Sub-bullets and keys will not go to graphics
Used for multiple values, based on aggregation of normalized individual values
 Min is conservative, while max is liberal

 Believability
 Comparison with a standard or experience
 Min {0.8, 0.7, 0.6) = 0.6
 Weighted average

 Appropriate Amount of Data


Min {Dp/Dn , Dn/Dp}

Dp: Data units provided


6
Dn: Data units needed
Data Quality Assessment Techniques
 Min-Max
C: Currency
 Timeliness V: Volatility
A: Age
Max {0, 1- C/V} C = A + Dt - It Dt: Delivery time
It: Input time (received in system)

 Accessibility
Max {0, 1- Trd/Tru}

Sub-bullets and keys will not go to graphics


Trd: Time between request
by user to delivery

Tru: Request by user to time


data remains useful

7
Data Quality Validation Techniques

 Referential Integrity (RI).


 Attribute domain.
 Using Data Quality Rules.
 Data Histograming.

8
Referential Integrity Validation
RI checked every week or month, and no. of orphan
records should be going down with time.
Yellow will not go to graphics
RI peculiar to DWH, not for operational systems

Example: How many outstanding payments in the


DWH without a corresponding customer_ID in the
customer table?

9
Business Case for RI

Not very interesting to know


number of outstanding payments
from a business point of view.

Interesting to know the actual


amount outstanding, on per year
basis, per region basis…

10
Performance Case for RI
Cost of enforcing RI is very high for large volume DWH
implementations, therefore:

 Should RI constraints be turned OFF in a data warehouse? or

 Should those records be “discarded” that violate one or more


RI constraints?

11
3 steps of Attribute Domain Validation
Step-1: Capture and quantify the occurrences of
each domain value within each coded attribute of
the database.
Yellow will go to graphics

Step-2: Compare actual content of attributes


against set of valid values.

Step-3: Investigate exceptions to determine


cause and impact of the data quality defects.

Note: Step 3 (above) applies to all defect types.

12
Attribute Domain Validation: What next?

What to do next?
 Trace back to source cause(s).

 Quantify business impact of the defects.

 Assess cost (and time frame) to fix and proceed


accordingly.

13
Data Quality Rules

14
Statistical Validation using Histogram
Spike of
Centurions (age >= 100 yrs)

outliers

1901 …………………………………………. 2000

NOTE: For a certain environment, the above distribution may


be perfectly normal.
15

You might also like