You are on page 1of 4

3

DATA QUALITY MANAGEMENT

LEARNING OUTCOMES
At the end of this topic, you should be able to:
• Explain the data quality dimensions of a data quality framework
• Perform a data profiling on a data set

INTRODUCTION
Data quality is one of the most important problems in data management. Real-life data are often dirty:
inconsistent, duplicated, inaccurate, or incomplete. Dirty data can appear due to errors when merging
various data sets from different sources, broken rules, data entry error, inconsistent data sources, and
data transmission error. If these errors are not corrected, we will have no quality data that leads to no
quality decision or result.

Data quality refers to a measure or set of measures, that give an organization an indication of the level
of confidence it can have in the data that is used in its operational and strategic decision-making
process. Data quality management is a set of processes by which we manipulate the organizations data
to increase its quality. In this topic, you will learn about the key data quality dimensions in the Data
Management Association (DAMA) framework and the data profiling process.

1.1 DATA QUALITY DIMENSIONS


Data quality dimensions is the characteristics of data that can be used to access data quality and identify
data quality issues. The Data Management Association (DAMA) [1] defined six key characteristics that
can be used to access data quality as in Figure 1:

Completeness Uniqueness Timeliness


Proportion of stored
Data represents
data against the No thing is recorded
reality from the
potential of "100%" more than once
required point of time
complete

Analysis of the
A measure of the number of things as
presence of non-blank assessed in the real- Measure the time
absence of missing world compared to difference
values number of things in
the data set

E.g. Emergency
E.g. 98% of the data E.g. Out of the 520 contact change which
contains the first student records, is effective on June
emergency contact 96.2% of the data is 1st is entered into the
phone number recorded only once system in June 4th, a
delay of 3 days
Validity Accuracy Consistency
The absence of
Data conforms to the difference, when
Data represents real-
syntax (format, type, comparing two or
world values
range) of its definition more representations
of a thing

Comparison between Degree to which the


Analysis of pattern
the data and data mirrors objects it
and/or value
metadata for the data represents in the real-
frequency
item world

E.g. Student's date of


E.g. Specific ruling for
birth must have the
the middle name E.g. Incorrect date of
same value and
value in the case of a birth as a result of
format in the school
person without differing date format
register as that stored
middle name
in Student database

Figure 1. Data Quality Dimensions.

A data variable must conform to any one or more of the data quality dimensions in Figure 1, but not
necessarily all. For example, specifically the date of birth, we can define following data quality rules:
• Validity: Birth date must be valid date in the range from 1900 to current date.
• Completeness: Birth date must be entered foe each individual and empty field are not allowed
From the example, we can see that each data quality rule is associated to a particular data quality
dimension. Multiple data quality rules can be associated to one data quality dimension.

**See Video: The Six Data Quality Dimensions with examples -


https://www.youtube.com/watch?v=wmj6Iw938_8

SELF-LEARNING ACTIVITY
DAMA is one of the common data quality frameworks that is used for data quality management. There
are also other data quality framework such as the ISO 8000, IBM, GS1.
Compare the DAMA, ISO 8000, IBM, and GS1 frameworks in terms of their data quality dimensions.
Write your answers in the forum.

1.2 DATA PROFILING


Data profiling is the process of reviewing source data, understanding structure, content and
interrelationships, and identifying potential for data projects. It is performed to help us discover some
data quality problems and give insight for data quality assessment. The very first step in a data quality
process is to do data profiling.

The activities in data profiling includes collecting descriptive statistics (i.e., min, max, count, sum),
collecting data types, length and recurring patterns, tagging data with keywords, descriptions or
categories and performing data quality assessment.

Data profiling allows you to answer the following questions about your data:
• Does the data available represent a complete picture of the data that should be present?
• Is the data conforming to the correct structure as would be expected when you observe it?
• If you have the same data in two different systems, are they the same values?
• If there are properties of data that are unique, does the data set show that?
• There will be a need to ensure that the data present is accurate. Is your data accurate?
The above questions are some of the questions that arise when we perform data quality assessment
based on the key dimensions explained in Section 1.1.

To understand data profiling, we are going to look at an example of Employee data as follows:

Table 1. Employee Data.

ID Employee Full Name Date of Birth


1 Michael Stanton 10/1/1968
2 Joe Irvine 12/3/1990
3 Jennifer Cipriani 7/24/1973
4 Salvatore Mendini 2/30/1968
5 Eva Carlos
6 Courtney O’Brien 12/12/1981
7 Frank Damon 9/18/1983
8 Katherine La Sal 1/22/1991

The Employee data in Table 1 can be profiled as follows:

Table 2: Data Profiling Results


Category Result
Number of records 8
Number of Unique Values 7
Number of Blanks 1

Further explanation on the data quality assessment process after we have performed data profiling can
be seen in the video “Use Case: Employee Data” which consists of data assessment against data quality
rules, data issue resolution and corrective measures to prevent further issues.
**Youtube video:
https://www.youtube.com/watch?v=kDOelMaTOuM&list=PLmEqVh8_i9736qJR_zTl1sRQ9rYI8ii3h
&index=5
(11:29 – 16:04)

SELF-LEARNING ACTIVITY
Perform data profiling and assess the relevance data quality parameters according to DAMA
framework.

SUMMARY
As you have learnt in this topic, one of the common data quality frameworks is the DAMA framework
that specifies six key dimensions to access data quality. There are also other frameworks with different
key dimensions such as the ISO 8000, IBM, and GS1 frameworks. We can perform a data quality
process based on any of these frameworks. Data profiling is the key step in a data quality process that
assist in identifying data problems.
KEYWORD
data quality, data quality dimensions, DAMA framework, data profiling

REFERENCES
[1] Dimensions of Data Quality (2020). Black, A. & van Nederpelt, P. Retrieved from
http://www.dama-nl.org/wp-content/uploads/2020/09/DDQ-Dimensions-of-Data-Quality-Research-
Paper-version-1.2-d.d.-3-Sept-2020.pdf

You might also like