You are on page 1of 36

Data Cleaning

• Elaboration of the Data & Data Quality slides


• Pertains to stored data quality
Slide Overview

► Introduction
► Data Quality Problems
► Data Quality Dimensions
► Relevant activities in Data Quality
Creating Quality Data

SOURCE DATA
TARGET DATA

Data Data Data


... Extraction Transformation Loading ...

ETL: Extraction, Transformation and Loading


Importance of Data Cleaning and Transformation
Data in the real world is dirty, i.e.
incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
noisy: containing errors or outliers (spelling, phonetic and typing errors, word
transpositions, multiple values in a single free-form field)
inconsistent: containing discrepancies in codes or names (synonyms and
nicknames, prefix and suffix variations, abbreviations, truncation and initials)
► e.g., Was rating “1,2,3”, now rating “A, B, C”
► e.g., discrepancy between duplicate records
Why Is Data Dirty?
► Incomplete data comes from:
► non available data value when collected
► different criteria between the time when the data was collected and when
it is analyzed.
► human/hardware/software problems
► Noisy data comes from:
► data collection: faulty instruments
► data entry: human or computer errors
► data transmission
► Inconsistent (and redundant) data comes from:
► Different data sources, so non uniform naming conventions/data codes
► Functional dependency and/or referential integrity violation
Why is Data Quality Important?

Activity of converting source data into target data without errors,


duplicates, and inconsistencies, i.e.,
Cleaning and Transforming to get High-quality data!

► Low quality data leads to low quality decisions!

► Quality decisions must be based on good quality data (e.g., duplicate or missing
data may cause incorrect or even misleading results)
Research issues related to DQ
• Source Selection
• Source Composition • Conflict Resolution • Record
• Query Result • Record Matching Matching(deduplication)
Selection •… • Data Transformation
• Time Syncronization •…
•…

Data Data
Integration Cleaning Statistical
Data
Analysis
Data Data Quality
Mining Managemen •
t Edit-imput
Information ation
• Record
• Error Localization Systems Linkage
• DB profiling Knowledge •…
• Patterns in text
strings Represent
•… ation • Assessment
• Process Improvement
•Conflict Resolution • Tradeoff
•… Cost/Optimization
•…
Data Quality Application contexts
► Integrate data from different sources
► E.g., populating a data warehouse from different operational data stores
► Eliminate errors and duplicates within a single source
► E.g., duplicates in a file of customers
► Migrate data from a source schema into a different fixed target schema
► E.g., discontinued application packages
► Convert poorly structured data into structured data
► E.g., processing data collected from the Web

8
Data Quality Dimensions ‘Recap’
► Accuracy
► Errors in data
Example:”Jhn” vs. “John”
► Currency
► Lack of updated data
Example: Residence (Permanent) Address: out-dated vs. up-to-dated
► Consistency
► Discrepancies into the data
Example: ZIP Code and City consistent
► Completeness
► Lack of data
► Partial knowledge of the records in a table or of the attributes in a record
Example completeness
Tools for Data Cleaning

► Ad-hoc programs written in a programming language like C or Java


or using an RDBMS proprietary language
► But programs are difficult to optimize and maintain

► RDBMS mechanisms for guaranteeing integrity constraints


► But these do not address important data instance problems

► Data transformation scripts using an ETL


(Extraction-Transformation-Loading) or data quality tool
Typical architecture of a Data Quality system
Human
SOURCE DATA Knowledge TARGET DATA

Data Data Data


... Extraction Transformation Loading ...

Data
Analysis Metadata Dictionaries

Human
Knowledge Schema
Integration
Data quality problems (1/3)
In a database environment:
► Schema level data quality problems prevented with better
schema design, schema translation and integration.

► Instance level data quality problems errors and inconsistencies of


data that are not prevented at schema level
Data quality problems (2/3)
► Schema level data quality problems
► Avoided by an RDBMS
► Missing data – product price not filled in
► Wrong data type – “abc” in product price
► Wrong data value – 0.5 in product tax (iva)
► Dangling data – category identifier of product does not exist
► Exact duplicate data – different persons with same ssn
► Generic domain constraints – incorrect invoice price
► Not avoided by an RDBMS
► Outdated temporal data – just-in-time requirement
► Inconsistent spatial data – coordinates and shapes
► Name conflicts – person vs person or person vs client
► Structural Conflicts - addresses
Data quality problems (3/3)
► Instance level data quality problems
► Single record
► Missing data in a not null field – id:-9999999
► Erroneous data – price:5 but real price:50
► Misspellings: Mary Dube versus Mary Duve
► Embedded values: Dr. Mary Dube
► Misfielded values: city: Gwanda, instead of Bulawayo
► Ambiguous data
► Multiple records
► Duplicate records: Name:Mary Dube, Birth:01/01/1950 and Name:Mary Dube,
Birth:01/01/1950
► Contradicting records: Mary Dube, Birth:01/01/1950 and Mary Dube,
Birth:01/01/1956
► Non-standardized data: Mary Dube, M. Dube
Data Quality Dimensions
Traditional data quality dimensions
► Accuracy
► Completeness
► Time-related dimensions: Currency, Timeliness, and
Volatility
► Consistency

► Their definitions do not provide quantitative measures so


one or more metrics have to be associated
► For each metric, one or more measurement methods have to be
provided regarding: (i) where the measurement is taken; (ii) what
data are included; (iii) the measurement device; and (iv) the
scale on which results are reported.
► Schema quality dimensions are also defined
Accuracy
► Closeness between a value v and a value v’, considered as the
correct representation of the real-world phenomenon that v aims to
represent.
► Ex: for a person name “John”, v’=John is correct, v=Jhn is incorrect
► Syntatic accuracy: closeness of a value v to the elements of the
corresponding definition domain D
► Ex: if v=Jack, even if v’=John , v is considered syntactically correct
► Measured by means of comparison functions (e.g., edit distance) that returns a
score
► Semantic accuracy: closeness of the value v to the true value v’
► Measured with a <yes, no> or <correct, not correct> domain
► Coincides with correctness
► The corresponding true value has to be known
Ganularity of accuracy definition
► Accuracy may refer to:
► a single value of a relation attribute
► an attribute or column
► a relation
► the whole database
Completeness
► “The extent to which data are of sufficient breadth, depth, and scope for
the task in hand.”
► Three types:
► Schema completeness: degree to which concepts and their properties
are not missing from the schema
► Column completeness: evaluates the missing values for a specific
property or column in a table.
► Population completeness: evaluates missing values with respect to a
reference population
Completeness of relational data
► The completeness of a table characterizes the extent to which the
table represents the real world.
► Can be characterized wrt:
► The presence/absence and meaning of null values
Example: Person(name, surname, birthdate, email), if email is null may
indicate the person has no mail (no incompleteness), email exists but
is not known (incompletenss), it is not known whether Person has an
email (incompleteness may not be the case)
► Validity of open world assumption (OWA) or closed world assumption
(CWA)
► OWA: cannot state neither the truth or falsity of facts not represented in the
tuples of a relation
► CWA: only the values actually present in a relational table and no other values
represent facts of the real world.
Time-related dimensions
► Currency: concerns how promptly data are updated
Example: if the residential address of a person is updated (it corresponds to
the address where the person lives) then the currency is high
► Volatility: characterizes the frequency with which data vary in time
Example: Birth dates (volatility zero) vs stock quotes (high degree of volatility)
► Timeliness: expresses how current data are for the task in hand
Example: The timetable for university courses can be current by containing the
most recent data, but it cannot be timely if it is available only after the
start of the classes.
Consistency

► Captures the violation of semantic rules defined over a set of data


items, where data items can be tuples of relational tables or records in
a file
► Integrity constraints in relational data
► Domain constraints, Key, inclusion and functional dependencies
► Data edits: semantic rules in statistics
Evolution of dimensions
► Traditional dimensions are Accuracy, Completeness, Timeliness,
Consistency
1. With the advent of networks, sources increase dramatically,
and data become often “found data”.
2. Federated data, where many disparate data are integrated, are
highly valued
3. Data collection and analysis are frequently disconnected.
► As a consequence we have to revisit the concept of data quality
and new dimensions become fundamental.
Other dimensions of data quality
► Interpretability: concerns the documentation and metadata that are
available to correctly interpret the meaning and properties of data
sources
► Synchronization between different time series: concerns proper
integration of data having different time stamps.
► Accessibility: measures the ability of the user to access the data from
his/her own culture, physical status/functions, and technologies
availavle.
Relevant activities in Data Quality
Improving Data Quality
► Standardization/normalization
► Record Linkage/Object identification/Entity identification/Record matching
► Data integration
► Schema matching
► Instance conflict resolution
► Source selection
► Result merging
► Quality composition
► Error localization/Data Auditing
► Data editing-inputation/Deviation detection
► Data profiling
► Structure induction
► Data correction/data cleaning/data scrubbing
► Schema cleaning
Standardization/normalization
► Modification of data with new data according to defined standards or
reference formats
Example:
► Change “Snr” to “Senior”
► Change of “Channel Str.” to “Channel Street”
Record Linkage/Object identification/ Entity
identification/Record matching/Duplicate detection

► Activity required to identify whether data in the same


source or in different ones represent the same object
of the real world
► Given two tables or two sets of tables, representing
two entities/objects of the real world, find and cluster
all records in tables referring to the same
entity/object instance
Data integration
► Task of presenting a unified view of data owned by
heterogeneous and distributed data sources
► Two sub-activities:
► Quality-driven query processing: task of providing query
results on the basis of a quality characterization of data at
sources
► Instance-level conflict resolution: task of identifying and
solving conflicts of values referring to the same real-world
objects.
Instance-level conflict resolution
► Instance level conflicts can be of three types:
► representation conflicts, e.g. USD vs. ZWL
► key equivalence conflicts, i.e. same real world objects with
different identifiers
► attribute value conflicts, i.e. Instances corresponding to same real
world objects and sharing an equivalent key, differ on other
attributes
Error localization/Data Auditing
► Given one/two/n tables or groups of tables, and
a group of integrity constraints/qualities (e.g.
completeness, accuracy), find records that do not
respect the constraints/qualities.
► Data editing-inputation
► Focus on integrity constraints
► Deviation detection
► data checking that marks deviations as possible data
errors
Data Profiling

► Evaluating statistical properties and


intentional properties of tables and
records
► Structure induction of a structural
description, i.e. “any form of regularity
that can be found”
Data correction/data cleaning/data scrubbing

► Given one/two/n tables or groups of tables, and a


set of identified errors in records with respect to
given qualities, generate probable corrections
and correct the records, in such a way that new
records respect the qualities.
Schema cleaning
► A schema is an abstract representation of the data
► Transform the conceptual schema in order to
achieve or optimize a given set of qualities (e.g.
Readability, Normalization), while preserving
other properties (e.g. equivalence of content)
References
► “Data Quality: Concepts, Methodologies and Techniques”, C. Batini and M. Scannapieco,
Springer-Verlag, 2006 (Chapts. 1, 2, and 4).
► “A Survey of Data Quality tools”, J. Barateiro, H. Galhardas, Datenbank-Spektrum 14:
15-21, 2005.

You might also like