You are on page 1of 18

Task 1

Introduction

The quality of data plays a huge role in making proper analytical decisions. Missing data,
Inconsistent data, Duplicate data, Invalid data and so on are the few problems that impact the
quality of data. The process of removing or reducing the errors to increase the data quality is
called data cleaning or scrubbing. The general life cycle of data is capture
,update,transmission,access,archive,restore,deletion and purge. When considering the quality
of data we mainly focus on the data access aspect of the life cycle. Data is considered as dirty
data when the user or application accessing the data ends up with the wrong result or not being
able to derive the result due to inconsistent data. The sources of dirty data include errors done
by humans or machines while entering data, errors during transmission of data and bugs while
processing the data.

Data Cleaning: Problems and Current Approaches

Data Quality Problems are distinguished between single and multiple source problems and
between schema and instance-related problems. In situations where the overall structure of the
data or the representation of content of data is changed i.e (schema) data transformation is
used. Similarly inconsistent and errors in data i.e (particular instance of data) that are not
visible in schema level is handled by data cleaning.

Single source problems usually arise due to lack of appropriate constraints specified by mode or
the application. So schema related problems are handled by following proper integrity
constraints i.e better schema design or reducing the integrity constraints for better integrity
control. Instance specific problems such as errors and inconsistent data is also solved by using
a better schema.

The problems described in single sources become more serious when multiple sources are
integrated as any or all of the sources might contain dirty data. Schema level conflicts such as
naming conflicts i.e (same name used for different objects and vice versa) occurs. Similarly
instance level conflicts such as duplicate data, contracting records , same data representation
and interpretation can vary and information may provide different aggregation levels also
provide further complications.Identifying overlapping data from different sources , object
identification i.e objects that represent same identity in the real world, duplicate elimination
(merger or purge) can be done to handle multiple source problems.
In the above figure two sources are integrated into a single table and they have schema and
data conflict. At the schema level there is name conflict (CID/CNO,Customer/Client,
Sex/Gender) and structural conflict i.e both table represent name and address differently. And at
instance level the gender are represented differently i.e M/F and 0/1. When combining the table
CID,CNO are both table specific identifiers so they are given a separated column, gender is
represented as M/F , address and name are represented as individual components as these
individual components can represent both of the tables.
A Taxonomy of Dirty Data

The paper uses a successive refinement approach to represent a comprehensive taxonomy of


dirty data. The taxonomy is represented in a hierarchy where the left nodes are broken down
until the nodes are intuitively obvious and no further breakdown can be made.Data manifests
itself in 3 different forms: missing data, not missing but wrong data and not missing and not
wrong but unusable data. The third problem mostly arises when two or more data sources are
integrated. The taxonomy contains 33 leaf nodes with primitive dirty data types which are shown
below.

In order to form a hierarchical view for dirty data, it is categorized into two subgroups one being
missing data and another being non-missing data. The invalid data is further classified into
wrong data and non-wrong but unusable.
1. Missing data
Missing data(1.1) represents data that are unknown and are allowed to be null.
Whereas (1.2) of the same category represents data that are unknown given the
condition that they are not allowed to be null.

2. Not missing data


Not-missing data is further divided into wrong unusable data (2.1) and not-wrong but
unusable data (2.2). Where not missing data (2.1) provides wrong results when
accessing it and not missing data (2.2) is usable but provides wrong results during
analysis or query.

In the context of wrong and unusable data, it mainly occurs when an entity has two
different data for the same field across different databases, or the data is simply un-
understandable because of the use of non-standard abbreviations, or the data simply is
a mixture of different types of data, or the data is represented differently (in encoded
form) or the units are misrepresented and so on. The graphical representation of this
sub-category is depicted in the figure below.
Conclusion

The first paper describes the source of dirty data as single and multiple source where
single source arises with bad schema design and multiple sources arise mostly due to
conflicting data in two or more sources of data. Whereas the second paper elaborates
the sources of dirty data into two categories i.e. missing and non-missing data and then
further divides them into clear and understandable sources based on the reasons of
occurrence. Similarly, the second paper also presents suggestions on how to prevent
the situation of dirty data being collected. The only shortcoming of the paper was that it
only proposes to prevent the collection of dirty data, not dealing with the already
collected dirty data. The both papers provide sources of dirty data and measures to
collect clean data but the only distinction between them is that the first paper surfacially
explains all of the factors whereas the second paper explains all of the factors causing
dirty data in detail and provides solutions on a one-by-one basis.
Task 2

The problems with the dataset are as follows:

● The dept_id column has missing values for employee_id=178 and the dept_id of 95 is
also not valid as there exists no department id with the dept_id of 95.

● Manger Id column has missing/null values,First name and last name have mis-spelled
data and email have duplicate data.

● The hiredate column is represented differently for different records i.e dd-mm-yy for a
few records, dd-mm-yyyy and dd/mm/yyyy for other records .

● The commission_pct has null values as well as it represents a whole number of few
records and as fractions for other records.

Handling the issues

● The missing data error in dept_id can be handled by using proper constraints to not
allow null data and in order to handle the not-missing but wrong data enforcing proper
integrity constraints supported by relation databases.
● The missing data in the manager column can be null as an employee cannot be the
manager of itself so this missing data can be handled by filling in some representative
data or intervention by domain experts.
● The error in the comission_pct can be handled by checking the data type or data
profiling with a proper domain expert.
● The inconsistency in the date can be handled by setting triggers or intervention by a
domain expert.
Sample Data

Missing Data

Misspelled Data
Duplicate Data

Inconsistent data

Invalid Data i.e Missing Foreign Key Reference


Importing Employee and Department
SQL Statements and Results
Department Query

You might also like