Professional Documents
Culture Documents
Data Quality
Prepared By
Saidatul Rahah Hamidi
Introduction
https://www.coursera.org/learn/big-data-machine-
learning/lecture/eqLb8/data-quality
Data Quality
Missing values
Duplicate data
Noise
Invalid Data
Outliers
https://www.coursera.org/learn/big-data-machine-
learning/lecture/tp2m0/addressing-data-quality-issues
Impacts of poor quality data
http://docs.media.bitpipe.com/io_25x/io_25186/item_384743/Top%2010%20Root%20Causes%20of%20Data%20Quality%20Prob
lems-%20wp_en_dq_top_10_dq_problems.pdf
Solutions
Monitoring
Make public the results of poorly entered data and praise those who enter data
correctly.
Real-time Validation
In addition to forms, validation data quality tools can be implemented to
validate addresses, e-mail addresses and other important information as it is
entered.
Communication
Regular communication and a well-documented metadata model will make the
process of change much easier.
Root Cause
Analysis
What is Root Cause Analysis
http://www.six-sigma-material.com/Data-Classification.html
Pareto Chart
A Pareto chart is a graphical tool to prioritize multiple problems in a process
so you can focus on areas where the largest opportunities exist.
Pareto charts are a type of bar chart in which the horizontal axis represents
categories of interest.
By ordering the bars from largest to smallest, a Pareto chart can help you
determine which of the defects comprise the “vital few”, and which are the
“trivial many.”
The Pareto principle states that 80% of the effect is generated by 20% of the
causes. We want to focus on the 20%.
100
120
100 80
Percent
80
Count
60
60
40
40
20
20
0 0
Exception HHG TQ/TA GHS AT New Res Other
Count 73 18 13 8 7 5
Percent 58.9 14.5 10.5 6.5 5.6 4.0
Cum % 58.9 73.4 83.9 90.3 96.0 100.0
19
Cause and Effect Diagram
(Also Called Fishbone)
What
A tool to represent the relationship between an effect
(problem) and its potential causes by category type.
When
Carried out when a root cause needs to be determined.
Why
To help ensure that a balanced list of ideas have been
generated during brainstorming.
To determine the real cause of
the problem versus a symptom.
To refine brainstormed ideas into
more
Root Cause Analysis detailed causes.
20
Example: Fishbone Diagram
Material Machine Methods Discovery of different
discount rates occurs too
late in process
Computer screens
http://slideplayer.com/slide/217791/
Root Cause Analysis 23
Root Cause Identification
Reduce the list of potential root causes
Rank root causes using Pareto Analysis
(Statistical)
Rank the items in order of significance
(Organizational)
Identify the items with the most significant
impact
Time
Cost
Manpower
Root Cause Analysis 24
Root Cause Identification
Confirm potential root causes relate to the
overall problem
Validate/Verify that root causes identified
have a causal relationship with the desired
output
Ensure the legitimacy of the measurement
system
Ensure results are repeatable and reproducible
31
Data
Definition
Data Data
Quality Quality
Monitoring Assessment
Problem
Resolution
Data Definition
Read more:
http://www.businessdictionary.com/definition/parsing.ht
ml
Data Quality Tools
Four methods:
Correct
Filter
Detect and Report
Prevent
What are the tools that can be used for data cleaning?
Discuss about the issues related to data cleaning
Data Quality Assessment