You are on page 1of 48

Topic 3

Data Quality
Prepared By
Saidatul Rahah Hamidi
Introduction

 Today, most organizations use data in two ways:


 Transactional/operational use (“running the business”)
and,
 Analytic use (“improving the business”)
 Both usage scenarios rely on high quality information
 Thus, it suggests the need for processes to ensure that
data is of sufficient quality to meet the needs
 Therefore, it is of great value to any enterprise to
incorporate a data quality program
 Includes processes for assessing, measuring, reporting,
reacting to, and controlling different aspects of risks
associated with poor data quality
Information Value and Data
Quality Improvement
 There are different ways of looking at information
value.
 The simplest approaches:
 Consider the cost of acquisition (i.e., the data is worth
what we paid for it)
 Its market value (i.e., what someone is willing to pay for
it)
 But in an environment where data is created, stored,
processed, exchanged, shared, aggregated, and reused,
perhaps the best approach for understanding
information value is its utility – the expected value to
be derived from the information (what we can get from
the information)
Data Quality

 Data quality is often defined as “fitness for use”, i.e. an


evaluation of to which extent some data serve the
purposes of the user
 Other definition “Data quality is about having
confidence in the quality of the data that you record
and the data you use”
 Data quality is divided into four dimensions: accuracy,
timeliness, completeness, and consistency (Ballou and
Pazer (1985)

https://www.coursera.org/learn/big-data-machine-
learning/lecture/eqLb8/data-quality
Data Quality

Dimensions contributing to data quality


Data Quality

 Accurate – refers to how closely the data correctly


captures what it is designed to capture
 E.g: Each data field is defined so that it is clear what type
of data is to be recorded, example DOB is in the format
dd/yy/mm
 Complete – data that has all those items required to
measure intended activity or event
 Legible – data that the intended users will find easy to
read and understand
 Relevant – meets the need of the information users
 Reliable – data is collected consistently over time and
reflects the true facts
Data Quality

 Timely – data is collected within a reasonable agreed


time period
 Valid – data is recorded in accordance with any rules
Data Quality Issues

 Missing values
 Duplicate data
 Noise
 Invalid Data
 Outliers
 https://www.coursera.org/learn/big-data-machine-
learning/lecture/tp2m0/addressing-data-quality-issues
Impacts of poor quality data

 The implications of poor quality data carry negative effects


to business users through:
 less customer satisfaction
 increased running costs
 inefficient decision-making processes
 lower performance and,
 lowered employee job satisfaction
 increases operational costs since time and other resources are
spent detecting and correcting errors.
 Gartner report that stated that an average organization
loses $8.2 million annually through poor quality data, 22
percent estimated their annual losses to be $20 million and
4 percent report losses were $100 million.
Causes of Poor Data Quality
 Manual data entry
 People mistype. They choose the wrong entry from a list. They
enter the right data value into the wrong box.
 Given complete freedom on a data field, those who enter data
have to go from memory. Is the vendor named Grainger, WW
Granger, or W. W. Grainger?
 Information obfuscation (not clear info)
 If a field is not available, an alternate field is often used. This
can lead to such data quality issues as having Tax ID numbers
in the name field or contact information in the comments
field.
 After the Merger
 they usually happen fast and are unforeseen by IT
departments.
 Mergers can result in a loss of expertise when key people leave
midway through the project to seek new ventures.

http://docs.media.bitpipe.com/io_25x/io_25186/item_384743/Top%2010%20Root%20Causes%20of%20Data%20Quality%20Prob
lems-%20wp_en_dq_top_10_dq_problems.pdf
Solutions
 Monitoring
 Make public the results of poorly entered data and praise those who enter data
correctly.
 Real-time Validation
 In addition to forms, validation data quality tools can be implemented to
validate addresses, e-mail addresses and other important information as it is
entered.
 Communication
 Regular communication and a well-documented metadata model will make the
process of change much easier.
Root Cause
Analysis
What is Root Cause Analysis

 A process of determining the causes that led to a


nonconformance, event or undesirable condition and
identifying corrective actions to prevent recurrence
which (when solved) restores the status quo or
establishes a desired effect.
Purpose
 Root Cause Analysis helps to identify what, how, and why
something happened, thus preventing recurrence.
 Root causes are underlying, are reasonably identifiable,
can be controlled by management and allow for the
generation of recommendations.
 The process involves data collection, cause charting, root
cause identification, recommendation generation and
implementation.
 Only when you are able to determine why an event or
failure occurred will you be able to specify workable
corrective measures.
Root Cause Analysis 14
Understanding Root Causes
 To fix a problem it must be clearly defined. In a lot
of cases the symptom is identified and not the
underlying problem.
 For example, buying expired milk is not an inspection
failure its a recall system failure.

 Questions to ask are:


 What is the scope of the problem?
 What else is affected by the problem?
 How often does it occur?
 What impact will this have on the larger population?

Root Cause Analysis 15


Determining Root Causes
Four steps you can use to identify the Root Cause
 Data Collection & Prioritization
 Pareto Analysis
 Cause Charting
 Cause and Effect Diagram (Fishbone)
 Root Cause Identification
 Recommendation Generation and Implementation
http://nexus.som.yale.edu/ph-tanzania/?q=node/131

Root Cause Analysis 16


Data Collection
 Data collection provides information and an understanding of causal
factors.
 Good data collection techniques involve:
 Data Types – Attribute or Discrete
 Good/Bad, Counts or Percentages
 (http://blog.minitab.com/blog/understanding-statistics/understanding-qualitative-quantitative-attribute-
)
discrete-and-continuous-data-types

 Planning – When, Who, How, Stratification


 Check Sheets - Consistency of Data Collection
 Measurement System Analysis
 Ensure the data collection process is “Repeatable and
Reproduceable”
Root Cause Analysis 17

http://www.six-sigma-material.com/Data-Classification.html
Pareto Chart
 A Pareto chart is a graphical tool to prioritize multiple problems in a process
so you can focus on areas where the largest opportunities exist.
 Pareto charts are a type of bar chart in which the horizontal axis represents
categories of interest.
 By ordering the bars from largest to smallest, a Pareto chart can help you
determine which of the defects comprise the “vital few”, and which are the
“trivial many.”
 The Pareto principle states that 80% of the effect is generated by 20% of the
causes. We want to focus on the 20%.

Root Cause Analysis 18


Sample Pareto Chart:
Processing Errors
Pareto Chart of Processing Errors
140

100
120

100 80

Percent
80
Count

60
60
40
40

20
20

0 0
Exception HHG TQ/TA GHS AT New Res Other
Count 73 18 13 8 7 5
Percent 58.9 14.5 10.5 6.5 5.6 4.0
Cum % 58.9 73.4 83.9 90.3 96.0 100.0

19
Cause and Effect Diagram
(Also Called Fishbone)
 What
 A tool to represent the relationship between an effect
(problem) and its potential causes by category type.
 When
 Carried out when a root cause needs to be determined.
 Why
 To help ensure that a balanced list of ideas have been
generated during brainstorming.
 To determine the real cause of
the problem versus a symptom.
 To refine brainstormed ideas into
more
Root Cause Analysis detailed causes.
20
Example: Fishbone Diagram
Material Machine Methods Discovery of different
discount rates occurs too
late in process
Computer screens

Too many “jumps” Billing process not


Updates
accurate
Product
Shortages
Master customer discount
table not up-to-date Effect: Too many price
adjustments at
Incomplete Training on check-out
Power Failures
Management Policies common complaints
Not enough staffing during
peak times
Marketing metrics
counterproductive Unfamiliarity with procedures

For vacation Notification of absence


MotherRootNature
Cause Analysis Measurements Manpower notification 21
Root Cause Identification
 Asking the right questions will help address the
actual problem and not the symptoms.
 Types of questions to ask:
 What is the scope of the problem?
 How many problems are there?
 What is affected by the problem?
 How often does the problem occur?

Root Cause Analysis 22


Root Cause Identification
Tools used to assist with Root Cause Identification:
 Data Analysis
 Pareto Charts
 Fishbone Diagrams
5 Why Technique
 Brainstorming
 Affinity Diagrams
http://s3-euw1-ap-pe-ws4-cws-documents.ri-
prod.s3.amazonaws.com/9781138889255/Appendix_A.pdf

http://slideplayer.com/slide/217791/
Root Cause Analysis 23
Root Cause Identification
 Reduce the list of potential root causes
 Rank root causes using Pareto Analysis
(Statistical)
 Rank the items in order of significance
(Organizational)
 Identify the items with the most significant
impact
Time
Cost
Manpower
Root Cause Analysis 24
Root Cause Identification
 Confirm potential root causes relate to the
overall problem
 Validate/Verify that root causes identified
have a causal relationship with the desired
output
 Ensure the legitimacy of the measurement
system
 Ensure results are repeatable and reproducible

Note: If you cannot state the problem simply,


Root Cause Analysis 25

you do not fully understand the problem.


Addressing the Root Cause(s)

 Conduct Value Add Analysis


 Ensure that items identified will
add value to the organization or
customer
 Ensure that the items are
required by regulation or policy
 Confirm that the item does not
add value and is not needed or
required
Root Cause Analysis 26
Recommendation
Implementation
 Things to consider prior to implementation:
 Determine the impact the root causes will
have on critical inputs (X)
 Estimate impact of the root cause on over-all
output (Y)

Root Cause Analysis 27


Recommendation
Implementation (Management)
 Implement recommendations based on:
 Significance to organizational goals and
objectives
 Availability of personnel, finances or other
essential resources
 Complexity of the implementation
 Evaluate controls required to maintain corrective
actions after implementation.
Root Cause Analysis 28
Definition of the 5 Whys

 The 5 Whys is an iterative question-asking


technique used to explore the cause-and-
effect relationships underlying a
particular problem.
 The primary goal of the technique is to
determine the root cause of a defect or
problem. (The "5" in the name derives
from an empirical observation on the
number of iterations typically required to
resolve the problem.)

Root Cause Analysis 29


Benefits of the 5 Whys

 Help identify the root cause of a problem.


 Determine if there is a relationship
between different root causes of a
problem.
 Simplicity; easy to complete without
statistical analysis.
 Effective when problems involve human
factors or interactions.

Root Cause Analysis 30


Table Top Exercise
 Problem Statement 1: You have to spend more and
more money on your utility bills.
 Problem Statement 3: You frequently arrive to work
late in the mornings and you are faced with disciplinary
action if you don’t correct it immediately.
 Problem Statement 4: You do not have enough money
to retire comfortably.

31

Root Cause Analysis


Data Quality
Management Plan
The Data Quality Management
Process

 The process of data quality management is composed of


four main steps, which can be organized in a continuous
loop, as shown in the following figure.
Data Quality Management

Data
Definition

Data Data
Quality Quality
Monitoring Assessment

Problem
Resolution
Data Definition

 In this step the data describing the business of the


undertaking must be appropriate and complete. The
definition of the data involves the identification of data
requirements that fulfill this criterion. Data requirements
should contain a proper description of the single items
and their relationship.
Data Quality Assessment

 Data quality assessment involves validating the data


according to the three criteria: appropriateness,
completeness, and accuracy. The assessment should
consider the channel through which data is collected
and elaborated, whether through internal systems,
external third parties, or publicly available electronic
sources.
Problem resolution

 The problems that are identified during the assessment


of the data quality are addressed in this phase. It is
important to document data limitations and justify the
remedies applied to poor data.
Data Quality Monitoring

 Data quality monitoring involves monitoring the


performance of the associated IT systems, based on
data quality performance indicators. data quality
monitoring involves two dimensions: quantitative and
qualitative.
Data Quality Tools

 Comprises much more than technology — it also includes


roles and organizational structures, processes for
monitoring, measuring and remediating data quality
issues, and links to broader information governance
activities via data-quality-specific policies.
 Example:
 Profiling
 Parsing and standardization
 Generalized "cleansing“
 Matching
 Monitoring
Data Quality Tools
 Profiling - The analysis of data to capture statistics
(metadata) that provide insight into the quality of data
and help to identify data quality issues.
 Discover metadata of the source database, including value
patterns and distributions, key candidates, foreign-key
candidates, and functional dependencies
 Parsing - The decomposition of text fields into
component parts and the formatting of values into
consistent layouts based on industry standards, local
standards (for example, postal authority standards for
address data), user-defined business rules, and
knowledge bases of values and patterns.
 Parsing : Breaking a data block into smaller chunks by
following a set of rules, so that it can be more easily
interpreted, managed, or transmitted

Read more:
http://www.businessdictionary.com/definition/parsing.ht
ml
Data Quality Tools

 Generalized "cleansing“ – The modification of data


values to meet domain restrictions, integrity constraints
or other business rules that define when the quality of
data is sufficient for an organization.
 Matching - Identifying, linking or merging related
entries within or across sets of data.
 Monitoring - Deploying controls to ensure that data
continues to conform to business rules that define data
quality for the organization.
Data Quality Tools

 Enrichment - Enhancing the value of internally held


data by appending related attributes from external
sources (for example, consumer demographic attributes
and geographic descriptors)
Data Cleaning

 No matter how efficient the process of data entry,


errors will still occur and therefore data validation and
correction cannot be ignored.
 Purpose
 To detect and fix defects errors
 To identify basic causes of errors, and used that
information to improve data entry process
Data Cleaning

 The process may include:


 format checks
 completeness checks
 reasonableness checks
 limit checks
 review of the data to identify outliers (geographic, statistical,
temporal or environmental) or other errors,
 assessment of data by subject area experts (e.g. taxonomic
specialists)
 missing values
 smooth noisy data
 identify or remove outliers, and
 resolve inconsistencies
Data Cleaning

 Data cleaning framework (Maletic & Marcus, 2000)


 Define and determine error types
 Search and identify error instances;
 Correct the errors;
 Document error instances and error types; and
 Modify data entry procedures to reduce future errors.
Data Cleaning

 Four methods:
 Correct
 Filter
 Detect and Report
 Prevent
 What are the tools that can be used for data cleaning?
 Discuss about the issues related to data cleaning
Data Quality Assessment

 Purpose - to identify the quality of the data in the


identified business activity
 The assessment results determine the accuracy,
completeness, consistency, precision, reliability,
temporal reliability, uniqueness and validity of the data
 What are the metrics/indicator/criteria used to
measure data quality?
Sources

• Evaluating the Business Impacts of Poor Data Quality


David Loshin President, Knowledge Integrity, Inc.
• Top 10 Root Causes of Data Quality Problems. Talend
White Paper. Talend* Open Integration Solutions
• Institute of Internal Audits (IIA). Terry Upshur-
Director of Support Services with the Inspector
General of the U.S. House of Representatives.
• Meeting the Data Quality Management Challenges of
Solvency. White Paper, Neri Massimiliano Associate
Director, Moody’s Analytics. May 2011.
• Magic Quadrant for Data Quality Tools 2012. Ted
Friedman. Gartner

You might also like