Professional Documents
Culture Documents
SIM - Chapters - DA T3
SIM - Chapters - DA T3
LEARNING OUTCOMES
At the end of this topic, you should be able to:
• Explain the data quality dimensions of a data quality framework
• Perform a data profiling on a data set
INTRODUCTION
Data quality is one of the most important problems in data management. Real-life data are often dirty:
inconsistent, duplicated, inaccurate, or incomplete. Dirty data can appear due to errors when merging
various data sets from different sources, broken rules, data entry error, inconsistent data sources, and
data transmission error. If these errors are not corrected, we will have no quality data that leads to no
quality decision or result.
Data quality refers to a measure or set of measures, that give an organization an indication of the level
of confidence it can have in the data that is used in its operational and strategic decision-making
process. Data quality management is a set of processes by which we manipulate the organizations data
to increase its quality. In this topic, you will learn about the key data quality dimensions in the Data
Management Association (DAMA) framework and the data profiling process.
Analysis of the
A measure of the number of things as
presence of non-blank assessed in the real- Measure the time
absence of missing world compared to difference
values number of things in
the data set
E.g. Emergency
E.g. 98% of the data E.g. Out of the 520 contact change which
contains the first student records, is effective on June
emergency contact 96.2% of the data is 1st is entered into the
phone number recorded only once system in June 4th, a
delay of 3 days
Validity Accuracy Consistency
The absence of
Data conforms to the difference, when
Data represents real-
syntax (format, type, comparing two or
world values
range) of its definition more representations
of a thing
A data variable must conform to any one or more of the data quality dimensions in Figure 1, but not
necessarily all. For example, specifically the date of birth, we can define following data quality rules:
• Validity: Birth date must be valid date in the range from 1900 to current date.
• Completeness: Birth date must be entered foe each individual and empty field are not allowed
From the example, we can see that each data quality rule is associated to a particular data quality
dimension. Multiple data quality rules can be associated to one data quality dimension.
SELF-LEARNING ACTIVITY
DAMA is one of the common data quality frameworks that is used for data quality management. There
are also other data quality framework such as the ISO 8000, IBM, GS1.
Compare the DAMA, ISO 8000, IBM, and GS1 frameworks in terms of their data quality dimensions.
Write your answers in the forum.
The activities in data profiling includes collecting descriptive statistics (i.e., min, max, count, sum),
collecting data types, length and recurring patterns, tagging data with keywords, descriptions or
categories and performing data quality assessment.
Data profiling allows you to answer the following questions about your data:
• Does the data available represent a complete picture of the data that should be present?
• Is the data conforming to the correct structure as would be expected when you observe it?
• If you have the same data in two different systems, are they the same values?
• If there are properties of data that are unique, does the data set show that?
• There will be a need to ensure that the data present is accurate. Is your data accurate?
The above questions are some of the questions that arise when we perform data quality assessment
based on the key dimensions explained in Section 1.1.
To understand data profiling, we are going to look at an example of Employee data as follows:
Further explanation on the data quality assessment process after we have performed data profiling can
be seen in the video “Use Case: Employee Data” which consists of data assessment against data quality
rules, data issue resolution and corrective measures to prevent further issues.
**Youtube video:
https://www.youtube.com/watch?v=kDOelMaTOuM&list=PLmEqVh8_i9736qJR_zTl1sRQ9rYI8ii3h
&index=5
(11:29 – 16:04)
SELF-LEARNING ACTIVITY
Perform data profiling and assess the relevance data quality parameters according to DAMA
framework.
SUMMARY
As you have learnt in this topic, one of the common data quality frameworks is the DAMA framework
that specifies six key dimensions to access data quality. There are also other frameworks with different
key dimensions such as the ISO 8000, IBM, and GS1 frameworks. We can perform a data quality
process based on any of these frameworks. Data profiling is the key step in a data quality process that
assist in identifying data problems.
KEYWORD
data quality, data quality dimensions, DAMA framework, data profiling
REFERENCES
[1] Dimensions of Data Quality (2020). Black, A. & van Nederpelt, P. Retrieved from
http://www.dama-nl.org/wp-content/uploads/2020/09/DDQ-Dimensions-of-Data-Quality-Research-
Paper-version-1.2-d.d.-3-Sept-2020.pdf