You are on page 1of 9

What’s and ‘Why’s of Data Cleaning

CSD102 - Session 5
Data for Decisions
 As we discussed in the last week, the main objective of Data Science is to enable us to take data
driven decisions.

 Data driven decisions are reliable only if the data is reliable.

 Data is collected from different sources, using different methods, by different people.

 This diversity of sources, methods, and people may lead to collection of unclean or uncleansed
data.

 These anomalies can harm the quality of data, and eventually the quality of the decisions taken
from that data.
Data Quality
 Data Specifications: Every time when we collect data, we want to arrange data in specific format and order. While

registering data, the entry operator might not follow these specifications, which leads to quality corruption.

 Errors: While registering data, there can be entry errors.

 Encoding of data: To analyse data, we need to code it with numbers. Same way for a large dataset, we may need to

distribute data in separate data tables. This requires to assign keys to columns / fields. Assigning keys, maintaining their
connectivity over different data tables, and ease of fetching records using the keys can be a challenge, which may
adversely affect data quality.

 Data Integrity: Like mentioned above, a large dataset, might be distributed in separate tables, and the connectivity is

the only base for maintaining data integrity. Duplication, omission, format violation will result in loss of data integrity.

 In the following slides we will discuss all these topics in detail.


Data Specifications
 Data requirement in specific format Currency
 Number / currency formats: The data may have different currency formats. ₤ 1,000,000 1,00,000 100000.00
We need to convert the values in the required currency or number format. Number Formats
 Date formats: Different date formats in one dataset also creates confusion
1.23E+09 1234320000 1,234,320,000
 Text / string length, and conventions (capital letters, punctuation,
Date
delimiters like ‘;’, ‘/’, ‘|’ etc.)
1-Sep-2020 09-01-2020 01-Sep
 Order of data: Unordered data will make no sense for any analysis. It
Order
is important to arrange the data in order of any leading key in the
Name Enrolment Date of Joining
dataset.

 Concatenation requirement (joining strings): Some times if data is


First Name Last Name
split in different fields / columns, it may not be treated for some text
Joining Data
analysis. We need to merge them in single column, in such case.
First Name Last Name Full Name
Errors
 Typographical Errors: Typographical errors can frequently occur in large datasets. When a team of researchers

collect data, separately, and compile, they are more prone to typographical errors. Same way, if the data entry is
assigned to a separate team, there can be errors, in case they do not understand the keywords or jargons.

 Missing Values: This is another, frequently occurring errors while collecting a large pool of data. And missing

values can lead to loss of data integrity.

 Duplication: Duplication is a serious error for distributed data. A duplicated key, will disconnect the data in two

tables.

 White Spaces: This is a commonly occurring error. While entering data in a software like MS Excel®, the entry

operator, by mistake, may add extra white spaces before, between, or / and after the text. When we attempt to
fetch the records, due to the extra spaces, we may end up with incorrect number of records.
Encoding Complexity or Errors
 Data lying in multiple tables / sheets / files:  IPL Data is divided in 6 files.
Connecting the last slide, it is required to distribute  The main file contains data for every delivery
data in different tables, for large collection of data.
bowled in 10 seasons. There are difficulties
This helps in easy entry of data.
reading the data like
 Complex links between data tuples: When we  The file is too large for a glance
distribute data in different separate tables, we have to  Every player, match, team, and season is coded
closely read keys in each record and have to find the  Every time, when we want to understand a
connected record in another table to understand the complete piece of information, we need to have a
information registered, which may become complex. look at all these files.
 Fetching difficulties: When a piece of information is
registered in more than two tables, with codes,
fetching it is challenging.
 Example : IPL Data Worksheet
Data Integrity
 Coding errors: In IPL example, the teams, seasons, players, and matches are coded with unique numbers. Any error in
writing the code for any match or team or any other field will violate data integrity, and fetching complete link of record
will be impossible.

 Primary and Foreign key issues: These are the keys to identify an entity uniquely. Like a player’s ID in player dataset
will be a primary key, by which we can identify him in the IPL main file, where this player’s ID is known as a foreign
key. As we can see Players’ IDs should never repeat in Player dataset, otherwise, we will never be able to find out,
which player ID refers to which player.

 Integrity loss due to data compilation from different files: Managing different files, like IPL data, is a difficult task. If
data is edited very frequently data is more prone to errors.

 Integrity loss due to data collection by different people: When multiple operators edit data in a common dataset, error
probability increases.
Next Session
 In our next session, we will learn some techniques to clean data using MS Excel®,
for some common entry errors. Which may include…
 Removing white spaces
 Concatenating or splitting data
 ‘Find-and-replace’ data for quick update
 Fetching data without errors with Excel functions, from different tables.
 Avoiding errors of integrity violation for foreign keys.

You might also like