You are on page 1of 26

These slides are at: http://www.macs.hw.ac.

uk/~dwcorne/Teaching/

On Data Quality
Suppose you have a database sitting in front
of you, and I ask ``Is it a good quality
database?’’
What is your answer? What does quality
depend on?
Note: this is about the data themselves, not
the system in use to access it.
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

• Source: Pipino et al. 2002


These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

A Conventional Definition of
Data Quality
Good quality data are:
Accurate, Complete, Unique,
Up-to-date, and Consistent ;
meaning …
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

A Conventional Definition of Data Quality, continued …

Accurate: This refers to how the data were


recorded in the first place. What might be
the inaccurately recorded datum in the
following table?
Barratt John 22 Maths BSc Male
Burns Robert 24 CS BSc Male
Carter Laura 20 Physics MSc Female
Davies Michael 12 CS BSc Male
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

A Conventional Definition of Data Quality, continued …

Complete: This refers to whether or not the


database really contains everything it is
supposed to contain.
E.g. a patient’s medical records should contain
references to all medication prescribed to date
for that patient.
The BBC Licensing DB should contain an entry
for every address in the country. Does it?
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

A Conventional Definition of Data Quality, continued …

Unique: Every separate datum appears only


once. How many `Data Quality errors’ can
you find in the following table, and what
types are they?
Surname Firstname DoB Driving test passed:
Smith J. 17/12/85 17/12/05
Smith Jack 17/12/85 17/12/2005
Smith Jock 17/12/95 17/12/2005
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

A Conventional Definition of Data Quality, continued …

Up-to-date: The data are kept up to date.


The post office has just changed my postcode
from EX6 8RA to EX6 8NU. Why does this
make it difficult for me to get a sensible
quote for home insurance or car insurance?
Can you think of a DB where it doesn’t matter
whether or not the data are kept up to date??
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

A Conventional Definition of Data Quality, continued …

Consistent: The data contains no logical


errors or impossibilities. It makes sense in
and of itself.
Why is the following mini DB inconsistent?

Date Sales Returns Net income


23rd Nov £25,609 £1,003 £24,506
24th Nov £26,202 £1,601 £24,601
25th Nov £28,936 £1,178 £25,758
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

The Several Problems with the


Conventional (or any?) Definition
Fine, we can define Data Quality (DQ) in a way that
makes it clear what kinds of issues we are thinking
about. But is the definition any use beyond that?
Can it be used:
• to Measure DQ, so that we can say whether or not
one DB is better than another, or so we can gauge
improvements in DQ over time.
• on a wide range of different DBs, of widely
different purposes, sizes, etc.
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

No: Four problems

There are at least four specific problems with this


definition if we are going to use it for any other
purpose than just discussing DQ issues:
Measurability,
rigidity,
context independence,
unhelpfulness
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

DQ Def problems: measurability

It is not clear how we might measure quality


according to some of these items.
Completeness: How will we know??
Uniqueness: It is hard to tell whether two
entries are similar, or duplicates!
Up-to-date-ness: How do we know?
Consistent: consistency errors can be very
hard to find, especially in a very large DB
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

DQ Def problems: rigidity

The conventional definition is rigid and static.


It is easy to think of two DBs that are high
quality according to this definition, but are
nevertheless of entirely different quality.
E.g. one may contain much more potentially
relevant information. One may have better
metadata. Etc…
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

DQ Def problems: context independence

The conventional definition does not take into


account the context of the DB. Different
things may be important for quality in
different circumstances.
E.g. a DB for public access must be highly
accessible and interpretable; in a DB of
experimental results, accuracy is more
important than consistency (why?), etc.
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

DQ Def problems: unhelpfulness/vagueness

Partly as a result of the fact that we need different


DQ measures of different DBs, depending on
context, the conventional definition alone is vague
and unhelpful when it comes to using it to provide
clues for how to improve data quality.
How do we improve accuracy? Depends on how the
data are recorded.
How do we improve consistency? Depends
immensely on the kind of data we have, etc …
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

Towards modern definitions of DQ

An ideal definition of data quality:


– Is backed up by a good understanding of how
and why errors occur.
– Relates to how the data are used
– Is helpful in leading to improvements in
processes
– Is helpful in leading to usable measures/metrics
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

The Data Quality Continuum


It’s rare that a datum is entered once into a DB and
then left alone. Usually, a datum has a long and
varied life, into which errors can arise at each and
every stage. The continuum is:
– Data gathering
– Data delivery
– Data storage
– Data integration
– Data retrieval
– Data analysis

So, if we want to monitor DQ, we need to monitor it


at each of these stages
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

The ISC DQ Continuum

Where might there occur errors, of:


Accuracy?
Completeness?
Uniqueness?
Timeliness?
Consisency?

What else is important in this case?


These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

Where DQ problems occur


(gathering)
• Manual data entry (how can we improve this?)
• Lack of uniform standards for format and
content.
• Duplicates arising from parallel entry
• Approximations, alternatives, entries altered
in order to cope with s/w and/or h/w
constraints.
• Measurement errors.
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

Where DQ problems occur


(delivery)
• Multiple hops from source to DB – problems can
happen anywhere
• Inappropriate pre-processing (e.g. removing some
`small’ seismograph readings before sending on to
ISC; rounding up or down, when the destination
needs more accurate data).
• Transmission problems: buffer overflows, checks
(did all files arrive, and all correctly?)
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

Where DQ problems occur


(storage)
• Poor, out of date or inappropriate metadata
• Missing timestamps
• conversion to storage format (e.g. to excel
files, to higher/lower precision
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

Where DQ problems occur


(integration)
This is the business of combining datasets – e.g.
from different parts of a company, from
(previously) different companies following an
acquisition; from different government agencies,
etc.
• Different keys, different fields, different formats
• Different definitions (`customer’, `income’, …)
• Sociological factors: reluctance to share!
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

Where DQ problems occur


(retrieval/analysis)
The problem here is usually the quality of DBs that store the
retrieved data, or the use of the retrieved data in general.
Problems arise because:
• The source DB is not properly understood!
• Straightforward mistakes in the queries that retrieve the
relevant data.
E.g. A database of genes contains entries that indicate
whether or not each gene has a known or suspected link
with cancer. A retrieval/analysis task leads to publishing a
list of genes that are not relevant to cancer. What is the
problem here?
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/

References
• A ppt presentation called `Data Quality and Data
Cleaning: An Overview’ by Tamrapani Dasu and
Theodore Johnson, at AT & T Labs
• A paper called `Data Cleaning: Problems and
Current Approaches’, by Erhard Rahm and Hong
Hai Do, University of Leipzig, Germany.
• Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002).
Data quality assessment. Communications of the
ACM, 45(4), 211-218.

You might also like