Data Management & Warehousing

Data Quality: Common Problems & Checks
Date: 24 April 2009 Location: Zagreb, Croatia
David M. Walker davidw@datamgmt.com +44 (0) 7050 028 911 - http://www.datamgmt.com

Agenda
•  •  •  •  •  Introduction Common Problems Automated Checking Profiling Checks Conclusions

24 April 2009

© 2009 Data Management & Warehousing

Page 2

Introduction
•  Data Quality problems are a SOURCE SYSTEM issue and an ETL issue
–  They just manifest themselves in the data warehouse

•  Prevention is better than cure
–  Fixing the source system or the ETL is ALWAYS cheaper and more effective than cleaning the data in the ETL or in the Data Warehouse itself

•  Data Quality is a continuous process
–  It is never finished and always needs to be monitored
24 April 2009 © 2009 Data Management & Warehousing Page 3

The Impact of Poor Data Quality
•  Devalues the data warehouse
–  Discourages people from trusting or using the system and therefore curtailing the life of the data warehouse

•  Highlights failings in the source system and/or the business process
–  Businesses would rather fix at any cost in the data warehouse and pretend that there isn’t a source system problem
24 April 2009 © 2009 Data Management & Warehousing Page 4

Common Problems
•  11 types of problem that account for the most common problems •  They usually reflect poor design and/or implementation of systems •  Most can be fixed or monitored and managed to limit the impact

24 April 2009

© 2009 Data Management & Warehousing

Page 5

Referential Issues
•  Keys that are not unique
–  Systems that do not enforce the unique primary key or flat files or spreadsheets –  Also generated by ETL that creates a surrogate key incorrectly

•  Referential Integrity Failures
–  Where referentially integrity is not enforced values in the child table are created that are not in the parent table
24 April 2009 © 2009 Data Management & Warehousing Page 6

Data Type Issues
•  Format Errors
–  Typically in Date/Time type fields –  02/04/2009 2nd April (UK) or 4th Feb (US)

•  Inappropriate Data Types
–  Storing Dates in Character Strings 20090624 as YYYYMMDD format string –  But what about 20090230?

24 April 2009

© 2009 Data Management & Warehousing

Page 7

Data Model Issues
•  De-normalised tables
–  Commonly created for performance reasons –  Inherently duplicates data –  Often gets out of sync

•  Data/Column Retirement
–  Upgrade to system retires a column –  ETL continues to use the old column

•  Poor Table/Column Naming
–  Don’t assume that a column does what it says –  Don’t assume that a column is still being used for it’s original purpose
24 April 2009 © 2009 Data Management & Warehousing Page 8

Data Content Issues
•  Null Values
–  Systems that have many optional fields will often have missing values –  Null values allow rows to be silently omitted from queries

•  Inappropriate Values
–  Databases allow special characters and/or leading/trailing white space –  “DataspaceQuality” != “DatatabQuality”
24 April 2009 © 2009 Data Management & Warehousing Page 9

Data Feed Issues
•  Missing Data
–  Where a stream of files are loaded by ETL if one is dropped it can go un-noticed –  Common with CDR type loads in Telcos

•  Late Data
–  A short term data quality issue –  Leaves users believing there is a problem –  Produces inconsistent reporting over time
24 April 2009 © 2009 Data Management & Warehousing Page 10

Automated Checking
•  Regularly run checks •  Broad Coverage across systems
–  100s and 1000s not 10s of queries –  Run in a low priority loop in the background

•  Used against:
–  Sources –  Data Staging –  Data Warehouse

•  No Product Required
–  We often implement this as a controlling shell script and lots of small scripts, one for each check
24 April 2009 © 2009 Data Management & Warehousing Page 11

Trending
•  Absolute Trending
–  Track an expected value over time
•  e.g. Returned Mail is usually less than 500 items per day •  If the value is <500 status is green, 501 to 1000 amber and >1000 red

•  Statistical Process Control (SPC) Trending
–  Track an expected value where the value changes over time
•  •  •  •  e.g. Telco CDRs – expect more as the company grows Don’t want to be continuously changing the threshold Compare current load to historical means If current load within 2 Standard Deviations – Green, 3 Standard Deviations – Amber, 4 or more Standard Deviations - Red
© 2009 Data Management & Warehousing Page 12

24 April 2009

Flow Control
•  Flow Control
–  ETL manipulates data
•  Joins, De-duplicates, Filters, Aggregates, etc

–  Use the formula:
Source Count - Filtered Count – DeDup Count – Target Count = 0

•  Trusted Source
–  Compare the result with a third system –  e.g. Does the Count of Switch CDRs = Count of those processed by the billing system Count of those processed in the DWH
24 April 2009 © 2009 Data Management & Warehousing Page 13

Business Rules Based
•  Specific rules to match known business rules
–  Account holders > 18 (Sys Date – DoB) –  Account holders < 115 –  Credit Card numbers are 16 digits long –  Number of accounts without a status

•  Result should yield Zero

24 April 2009

© 2009 Data Management & Warehousing

Page 14

Automated Checking - Ops
•  Managed by exception
–  Red given priority –  Amber are always followed up

•  Massive number of checks
–  100’s are good –  1000’s are better

•  Presentation
–  Alerts, RAG, Graphical, Numerical, etc.
24 April 2009 © 2009 Data Management & Warehousing Page 15

Data Profiling Checks
•  Run manually because they need to be interpreted by a human •  Leads to new business rules being added to the automated checks •  Can be done with simple reporting tools or commercial data profiling tools

24 April 2009

© 2009 Data Management & Warehousing

Page 16

Frequency Outliers
•  Count discreet values in a table and check items with many more or less than normal
–  e.g. DoB 01-01-01 many times more common than any other value indicates source default and something that needs work –  e.g. Count of SMS messages significantly lower on a given day may equate to a genuine system failure and therefore not a DQ problem
24 April 2009 © 2009 Data Management & Warehousing Page 17

Maximum & Minimum
•  Determine what a valid range for any value should be
–  e.g. age between 18 and 115 –  Immediately finds individual data quality issues that can be resolved –  Allows an analyst to create new business rules to prevent future problems

24 April 2009

© 2009 Data Management & Warehousing

Page 18

Sequential Keys
•  If a system has a sequential key: Max Value – Min Value – Count = 0 •  If this is true – is it too perfect for an operational system and therefore test data •  If this is false – what has caused the gaps, are the deletions intentional?
24 April 2009 © 2009 Data Management & Warehousing Page 19

Data Types
•  Validation of mis-used data types before loading
–  e.g. Dates in Character fields –  Format: YYYYMMDD –  Check MM between 01 and 12 –  Check DD between 01 and 31 –  Check MMDD does not include 0230, 0231 –  etc.
24 April 2009 © 2009 Data Management & Warehousing Page 20

Skewed Pattern Profiling
•  Looking for specific patterns in data
–  e.g. UK National Insurance Numbers (?) have the format AA 99 99 99 A –  Pattern match all values looking for exceptions

•  Number Lengths are a special case
–  e.g Credit Card Numbers are 16 digits long

24 April 2009

© 2009 Data Management & Warehousing

Page 21

Content Checking
•  Content Checking is the manual review of character strings •  Needs a good understanding of the nature of the data •  Often determines the need to do analysis of other types

24 April 2009

© 2009 Data Management & Warehousing

Page 22

Nulls & White Space
•  Nulls
–  Fields that have large proportion of nulls are usually not useful –  Also common is default status of null (e.g. Account us either closed or null)

•  White Space
–  Not Null fields with a single space –  Tab instead of space –  Leading/Trailing white space –  Double White Space: “DavidSpaceSpaceWalker”
24 April 2009 © 2009 Data Management & Warehousing Page 23

Punctuation & Control Chars
•  Punctuation
–  CSV files that are not properly quoted perform field shifts –  Address lines with extra commas

•  Control Characters
–  Data fields that contain ASCII character codes 0 to 31 and 127 to 159 are often ‘invisible’ when viewed in queries but cause failures –  Also be aware of ‘code-page’ specifics
24 April 2009 © 2009 Data Management & Warehousing Page 24

Problem Management Matrix

24 April 2009

© 2009 Data Management & Warehousing

Page 25

Continuous DQ Process

24 April 2009

© 2009 Data Management & Warehousing

Page 26

Quality is FREE …
… as long as you are prepared to INVEST HEAVILY in it Philip Crosby 1980 Especially true of Data Quality
24 April 2009 © 2009 Data Management & Warehousing Page 27

Data Management & Warehousing

Data Quality: Common Problems & Checks

Thank You
David M. Walker davidw@datamgmt.com +44 (0) 7050 028 911 - http://www.datamgmt.com