Professional Documents
Culture Documents
What's and Why's of Data Cleaning: CSD102 - Session 5
What's and Why's of Data Cleaning: CSD102 - Session 5
CSD102 - Session 5
Data for Decisions
As we discussed in the last week, the main objective of Data Science is to enable us to take data
driven decisions.
Data is collected from different sources, using different methods, by different people.
This diversity of sources, methods, and people may lead to collection of unclean or uncleansed
data.
These anomalies can harm the quality of data, and eventually the quality of the decisions taken
from that data.
Data Quality
Data Specifications: Every time when we collect data, we want to arrange data in specific format and order. While
registering data, the entry operator might not follow these specifications, which leads to quality corruption.
Encoding of data: To analyse data, we need to code it with numbers. Same way for a large dataset, we may need to
distribute data in separate data tables. This requires to assign keys to columns / fields. Assigning keys, maintaining their
connectivity over different data tables, and ease of fetching records using the keys can be a challenge, which may
adversely affect data quality.
Data Integrity: Like mentioned above, a large dataset, might be distributed in separate tables, and the connectivity is
the only base for maintaining data integrity. Duplication, omission, format violation will result in loss of data integrity.
collect data, separately, and compile, they are more prone to typographical errors. Same way, if the data entry is
assigned to a separate team, there can be errors, in case they do not understand the keywords or jargons.
Missing Values: This is another, frequently occurring errors while collecting a large pool of data. And missing
Duplication: Duplication is a serious error for distributed data. A duplicated key, will disconnect the data in two
tables.
White Spaces: This is a commonly occurring error. While entering data in a software like MS Excel®, the entry
operator, by mistake, may add extra white spaces before, between, or / and after the text. When we attempt to
fetch the records, due to the extra spaces, we may end up with incorrect number of records.
Encoding Complexity or Errors
Data lying in multiple tables / sheets / files: IPL Data is divided in 6 files.
Connecting the last slide, it is required to distribute The main file contains data for every delivery
data in different tables, for large collection of data.
bowled in 10 seasons. There are difficulties
This helps in easy entry of data.
reading the data like
Complex links between data tuples: When we The file is too large for a glance
distribute data in different separate tables, we have to Every player, match, team, and season is coded
closely read keys in each record and have to find the Every time, when we want to understand a
connected record in another table to understand the complete piece of information, we need to have a
information registered, which may become complex. look at all these files.
Fetching difficulties: When a piece of information is
registered in more than two tables, with codes,
fetching it is challenging.
Example : IPL Data Worksheet
Data Integrity
Coding errors: In IPL example, the teams, seasons, players, and matches are coded with unique numbers. Any error in
writing the code for any match or team or any other field will violate data integrity, and fetching complete link of record
will be impossible.
Primary and Foreign key issues: These are the keys to identify an entity uniquely. Like a player’s ID in player dataset
will be a primary key, by which we can identify him in the IPL main file, where this player’s ID is known as a foreign
key. As we can see Players’ IDs should never repeat in Player dataset, otherwise, we will never be able to find out,
which player ID refers to which player.
Integrity loss due to data compilation from different files: Managing different files, like IPL data, is a difficult task. If
data is edited very frequently data is more prone to errors.
Integrity loss due to data collection by different people: When multiple operators edit data in a common dataset, error
probability increases.
Next Session
In our next session, we will learn some techniques to clean data using MS Excel®,
for some common entry errors. Which may include…
Removing white spaces
Concatenating or splitting data
‘Find-and-replace’ data for quick update
Fetching data without errors with Excel functions, from different tables.
Avoiding errors of integrity violation for foreign keys.