Professional Documents
Culture Documents
► Introduction
► Data Quality Problems
► Data Quality Dimensions
► Relevant activities in Data Quality
Creating Quality Data
SOURCE DATA
TARGET DATA
► Quality decisions must be based on good quality data (e.g., duplicate or missing
data may cause incorrect or even misleading results)
Research issues related to DQ
• Source Selection
• Source Composition • Conflict Resolution • Record
• Query Result • Record Matching Matching(deduplication)
Selection •… • Data Transformation
• Time Syncronization •…
•…
Data Data
Integration Cleaning Statistical
Data
Analysis
Data Data Quality
Mining Managemen •
t Edit-imput
Information ation
• Record
• Error Localization Systems Linkage
• DB profiling Knowledge •…
• Patterns in text
strings Represent
•… ation • Assessment
• Process Improvement
•Conflict Resolution • Tradeoff
•… Cost/Optimization
•…
Data Quality Application contexts
► Integrate data from different sources
► E.g., populating a data warehouse from different operational data stores
► Eliminate errors and duplicates within a single source
► E.g., duplicates in a file of customers
► Migrate data from a source schema into a different fixed target schema
► E.g., discontinued application packages
► Convert poorly structured data into structured data
► E.g., processing data collected from the Web
8
Data Quality Dimensions ‘Recap’
► Accuracy
► Errors in data
Example:”Jhn” vs. “John”
► Currency
► Lack of updated data
Example: Residence (Permanent) Address: out-dated vs. up-to-dated
► Consistency
► Discrepancies into the data
Example: ZIP Code and City consistent
► Completeness
► Lack of data
► Partial knowledge of the records in a table or of the attributes in a record
Example completeness
Tools for Data Cleaning
Data
Analysis Metadata Dictionaries
Human
Knowledge Schema
Integration
Data quality problems (1/3)
In a database environment:
► Schema level data quality problems prevented with better
schema design, schema translation and integration.